Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#6662 public traces: update doc #7139

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 104 additions & 31 deletions clients/drcachesim/docs/drcachesim.dox.in
Original file line number Diff line number Diff line change
Expand Up @@ -998,7 +998,7 @@ $ bin64/drrun -t drmemtrace -indir newdir -tool basic_counts
\endcode

****************************************************************************
\page google_workload_traces Google Workload Traces
\page google_workload_traces Google Workload Traces (Version 2)

With the rapid growth of internet services and cloud computing,
workloads on warehouse-scale computers (WSCs) have become an important
Expand All @@ -1010,48 +1010,109 @@ architecture to achieve optimal efficiency. Google is sharing
instruction and memory address traces from workloads running in Google
data centers so that computer architecture researchers can study and
develop new architecture ideas to improve the performance and
efficiency of this important class of workloads.
efficiency of this important class of workloads. To protect Google's
intellectual property, these traces have had their original ISA replaced
with a synthetic ISA that we call #DR_ISA_REGDEPS. This synthetic ISA
removes architecture specific details (e.g., the opcode of instructions),
while still providing enough information (e.g., register dependencies,
instruction categories) to perform meaningful analyses and simulations.

\section sec_google_format Trace Format
\section sec_google_format Public Trace Format

The Google workload traces are captured using DynamoRIO's
[drmemtrace](@ref page_drcachesim). The traces are records of
instruction and memory accesses as described at \ref
sec_drcachesim_format. We separate instruction and memory access
records from each software thread into a separate file
(.memtrace.gz). In addition, for each software thread, we also provide
a branch_trace which contains execution data (taken/not taken, branch
target) about each branch instruction (conditional, non-conditional,
calls, etc.). Finally, for each workload trace, we provide a thread
statistics file (.threadstats.csv) which contains the thread ID (tid),
instruction count, non-fetched instruction count (e.g. implicit
instructions generated from microcode), load count, store count, and
prefetch count.
sec_drcachesim_format. While memory accesses are left unchanged
compared to the original trace, instructions follow the
#DR_ISA_REGDEPS synthetic ISA.

#DR_ISA_REGDEPS has the purpose of preserving register dependencies and giving
hints on the type of operation an instruction performs.

Being a synthetic ISA, some routines that work on instructions coming from an
actual ISA (such as #DR_ISA_AMD64) are not supported (e.g., decode_sizeof()).
We do support decode() and decode_from_copy(): to decode an encoded #DR_ISA_REGDEPS
instruction into an #instr_t.

A #DR_ISA_REGDEPS #instr_t contains the following information:
- Categories: composed by #dr_instr_category_t values, they indicate the type of
operation performed (e.g., a load, a store, a floating point math operation, a
branch, etc.). Note that categories are composable, hence more than one category
can be set. This information can be obtained using instr_get_category().
- Arithmetic flags: we don't distinguish between different flags, we only report if
at least one arithmetic flag was read (all arithmetic flags will be set to read)
and/or written (all arithmetic flags will be set to written). This information
can be obtained using instr_get_arith_flags().
- Number of source and destination operands: we only consider register operands.
This information can be obtained using instr_num_srcs() and instr_num_dsts().
Memory operands can be deduced by subsequent read and write records in the trace.
- Source operation size: is the largest source operand the instruction operates on.
This information can be obtained using instr_get_operation_size().
- List of register operand identifiers: they are contained in #opnd_t lists,
separated in source and destination. Note that these #reg_id_t identifiers are
virtual and it should not be assumed that they belong to any DR_REG_ enum value
of any specific architecture. These identifiers are meant for tracking register
dependencies with respect to other #DR_ISA_REGDEPS instructions only. These
lists can be obtained by walking the #instr_t operands with instr_get_dst() and
instr_get_src().
- ISA mode: is always #DR_ISA_REGDEPS. This information can be obtained using
instr_get_isa_mode().
- Encoding bytes: an array of bytes containing the #DR_ISA_REGDEPS #instr_t
encoding. Note that this information is present only for decoded instructions
(i.e., #instr_t generated by decode() or decode_from_copy()). This information
can be obtained using instr_get_raw_bits().
- Length: the length of the encoded instruction in bytes. Note that this
information is present only for decoded instructions (i.e., #instr_t generated by
decode() or decode_from_copy()). This information can be obtained using instr_length().
Be aware that in Google Workload Traces the instruction fetch size of a
#dynamorio::drmemtrace::memref_t# and the instr_length() of the corresponding fetched
edeiana marked this conversation as resolved.
Show resolved Hide resolved
instruction do not match! For convenience reasons we kept the instruction fetch size to
edeiana marked this conversation as resolved.
Show resolved Hide resolved
be the same as the size of the original ISA instruction.

Note that all routines that operate on #instr_t and #opnd_t are also supported for
#DR_ISA_REGDEPS instructions and their operands. However, querying information outside
of those described above (e.g., the instruction opcode with instr_get_opcode()) will
return the zeroed value set by instr_create() or instr_init() when the #instr_t was
created.

On top of instructions and memory acceses, traces also have
#dynamorio::drmemtrace::trace_marker_type_t markers.
All markers of the original trace are present, except for:
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_IDX
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_TRACE_START
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_TRACE_END
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL_FAILED
edeiana marked this conversation as resolved.
Show resolved Hide resolved
Which have been removed.
Because tracing overhead results into inflated context switches, the
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID values have been modified to
"unknown CPU" to avoid confusion. We recommend users to use our scheduler
(see \ref sec_drcachesim_sched) for a realistic schedule of a trace's threads.
Also, we preserved the following markers:
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_ID
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_ARG
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_RETVAL
- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_RETADDR
But only for SYS_futex functions.

Finally, every trace has a v2p.textproto file associated to it, which provides a
plausible virtual to physical mapping of the virtual addresses present in a trace
for more realistic TLB simulations. This is a static virtual to physical mapping
with 2 MB pages. Users can generate different mappings (e.g., smaller page size)
by modifying this file, or create their own mapping following the same
v2p.textproto format.

\section sec_google_get Getting the Traces

The Google Workload Traces can be downloaded from:

- [Google workload trace folder](https://console.cloud.google.com/storage/browser/external-traces)
- [Google workload trace folder](TODO: add new link to Google Storage Bucket once known)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we'd use the same bucket? Should we make version_1 and version_2 subdirs? Or just delete v1? Maybe just delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, I'd keep version 1 in that bucket as is.
I added the deprecated section that says DynamoRIO 11.0 is the last version that will support these traces, so if somebody is still using them, they can still do so with DynamoRIO 11.0.
Can't we use a new bucket? "external_traces_v2"? Do we need a subdir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new one could be sthg like console.cloud.google.com/storage/browser/external-traces-v2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I created: https://console.cloud.google.com/storage/browser/external-traces-v2 (now empty).
I've added it as link to the new public traces.
I also added a section about deprecated public v1 traces with the old link to them.

I think it might be a good idea to also have DOI for citations? We could make one with https://zenodo.org/ . This way it's easy to track who/how many cite this work.


Directory convention:
Directory structure:
- \verbatim
workload/trace-X/
\endverbatim
where X is sequential starting from 1

Filename convention:
- Memory trace file:
\verbatim
<uuid>.<tid>.memtrace.gz
\endverbatim
- Branch trace file:
\verbatim
<uuid>.branch_trace.<tid>.csv.gz
\endverbatim
- Thread statistics summary:
\verbatim
<uuid>.threadstats.csv
workload_name/
<uuid>.<tid>.memtrace.zip
v2p.textproto
\endverbatim

\section sec_google_help Getting Help and Reporting Bugs
Expand Down Expand Up @@ -1087,6 +1148,18 @@ You can contribute to the project in many ways:
- Sharing and collaborating on architecture research.
- Reporting issues: see \ref sec_google_help

\section sec_public_v1_deprecated Deprecated Google Workload Traces (Version 1)

The previous version of Google workload traces contains a subset of the
information of the current traces and has been deprecated.
Please use the current version described above.

The previous version can still be found at:

- [Google workload trace folder (Version 1)](https://console.cloud.google.com/storage/browser/external-traces)

DynamoRIO 11.0 is the latest version that supports these traces.

****************************************************************************
\page sec_drcachesim_config_file Configuration File

Expand Down
Loading