DynamoRIO · edeiana · Jan 10, 2025 · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024
diff --git a/clients/drcachesim/docs/drcachesim.dox.in b/clients/drcachesim/docs/drcachesim.dox.in
@@ -1010,48 +1010,109 @@ architecture to achieve optimal efficiency.  Google is sharing
 instruction and memory address traces from workloads running in Google
 data centers so that computer architecture researchers can study and
 develop new architecture ideas to improve the performance and
-efficiency of this important class of workloads.
+efficiency of this important class of workloads. To protect Google's
+intellectual property, these traces have been modified in order to
+filter out sensitive information. These traces follow a synthetic ISA
+(#DR_ISA_REGDEPS) that removes architecture specific details (e.g., the
+opcode of instructions), while still providing enough information (e.g.,
+register dependencies) to perform meaningful analyses and simulations.
 
-\section sec_google_format Trace Format
+\section sec_google_format Public Trace Format
 
 The Google workload traces are captured using DynamoRIO's
 [drmemtrace](@ref page_drcachesim).  The traces are records of
 instruction and memory accesses as described at \ref
-sec_drcachesim_format.  We separate instruction and memory access
-records from each software thread into a separate file
-(.memtrace.gz). In addition, for each software thread, we also provide
-a branch_trace which contains execution data (taken/not taken, branch
-target) about each branch instruction (conditional, non-conditional,
-calls, etc.).  Finally, for each workload trace, we provide a thread
-statistics file (.threadstats.csv) which contains the thread ID (tid),
-instruction count, non-fetched instruction count (e.g. implicit
-instructions generated from microcode), load count, store count, and
-prefetch count.
+sec_drcachesim_format. While memory accesses are left unchanged
+compared to the original trace, instructions follow the
+#DR_ISA_REGDEPS synthetic ISA.
+
+#DR_ISA_REGDEPS has the purpose of preserving register dependencies and giving
+hints on the type of operation an instruction performs.
+
+Being a synthetic ISA, some routines that work on instructions coming from an
+actual ISA (such as #DR_ISA_AMD64) are not supported (e.g., decode_sizeof()).
+
+Currently we support:
+- instr_convert_to_isa_regdeps(): to convert an #instr_t of an actual ISA to a
+  #DR_ISA_REGDEPS #instr_t.
+- instr_encode() and instr_encode_to_copy(): to encode a #DR_ISA_REGDEPS #instr_t
+  into a sequence of contiguous bytes.
+- decode() and decode_from_copy(): to decode an encoded #DR_ISA_REGDEPS instruction
+  into an #instr_t.
+
+A #DR_ISA_REGDEPS #instr_t contains the following information:
+- categories: composed by #dr_instr_category_t values, they indicate the type of
+  operation performed (e.g., a load, a store, a floating point math operation, a
+  branch, etc.).  Note that categories are composable, hence more than one category
+  can be set.  This information can be obtained using instr_get_category().
+- arithmetic flags: we don't distinguish between different flags, we only report if
+  at least one arithmetic flag was read (all arithmetic flags will be set to read)
+  and/or written (all arithmetic flags will be set to written).  This information
+  can be obtained using instr_get_arith_flags().
+- number of source and destination operands: we only consider register operands.
+  This information can be obtained using instr_num_srcs() and instr_num_dsts().
+- source operation size: is the largest source operand the instruction operates on.
+  This information can be obtained by accessing the #instr_t operation_size field.
+- list of register operand identifiers: they are contained in #opnd_t lists,
+  separated in source and destination. Note that these #reg_id_t identifiers are
+  virtual and it should not be assumed that they belong to any DR_REG_ enum value
+  of any specific architecture. These identifiers are meant for tracking register
+  dependencies with respect to other #DR_ISA_REGDEPS instructions only.  These
+  lists can be obtained by walking the #instr_t operands with instr_get_dst() and
+  instr_get_src().
+- ISA mode: is always #DR_ISA_REGDEPS.  This information can be obtained using
+  instr_get_isa_mode().
+- encoding bytes: an array of bytes containing the #DR_ISA_REGDEPS #instr_t
+  encoding.  Note that this information is present only for decoded instructions
+  (i.e., #instr_t generated by decode() or decode_from_copy()).  This information
+  can be obtained using instr_get_raw_bits().
+- length: the length of the encoded instruction in bytes.  Note that this
+  information is present only for decoded instructions (i.e., #instr_t generated by
+  decode() or decode_from_copy()).  This information can be obtained by accessing
+  the #instr_t length field.
+
+Note that all routines that operate on #instr_t and #opnd_t are also supported for
+#DR_ISA_REGDEPS instructions.  However, querying information outside of those
+described above (e.g., the instruction opcode with instr_get_opcode()) will return
+the zeroed value set by instr_create() or instr_init() when the #instr_t was
+created (e.g., instr_get_opcode() would return OP_UNDECODED).
+
+On top of instructions and memory acceses, traces also have
+#dynamorio::drmemtrace::trace_marker_type_t markers.
+All markers of the original trace are present, except for:
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_IDX
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_TRACE_START
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_TRACE_END
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_FAILED, which have been removed.
+Because tracing overhead results into inflated context switches, the
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID values have been modified to
+"unknown CPU" to avoid confusion. We recommend users to use our scheduler for a
+realistic schedule of a trace threads. Also, the only
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_ID
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_ARG
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_RETVAL
+#dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_RETADDR markers preserved are those
+related to SYS_futex functions.
+
+Finally, every trace has a v2p.textproto file associated to it, which provides a
+plausible virtual to physical mapping of the virtual addresses present in a trace
+for more realistic TLB simulations. This is a static virtual to physical mapping
+with 2 MB pages. Users can generate different mappings (e.g., smaller page size)
+by modifying such file, or create their own mapping following the same
+v2p.textproto format.
 
 \section sec_google_get Getting the Traces
 
 The Google Workload Traces can be downloaded from:
 
- - [Google workload trace folder](https://console.cloud.google.com/storage/browser/external-traces)
+ - [Google workload trace folder](TODO: add new link to Google Storage Bucket once known)
 
-Directory convention:
+Directory structure:
 - \verbatim
-  workload/trace-X/
-  \endverbatim
-  where X is sequential starting from 1
-
-Filename convention:
-- Memory trace file:
-  \verbatim
-  <uuid>.<tid>.memtrace.gz
-  \endverbatim
-- Branch trace file:
-  \verbatim
-  <uuid>.branch_trace.<tid>.csv.gz
-  \endverbatim
-- Thread statistics summary:
-  \verbatim
-  <uuid>.threadstats.csv
+  workload_name/
+    <uuid>.<tid>.drmemtrace.zip
+    v2p.textproto
   \endverbatim
 
 \section sec_google_help Getting Help and Reporting Bugs