riscv-paper-experiments

RTL Simulation Quickstart

Let's start by spinning up the snitch-toolchain Docker container, mounting in it a clone of this repo at /src and running the whole test suite:

$ git clone --recursive https://github.com/opencompl/riscv-paper-experiments.git
$ docker run --rm -ti --volume $PWD/riscv-paper-experiments:/src ghcr.io/opencompl/snitch-toolchain:latest bash
$ cd /src
$ make

The same can be done in one shot:

$ git clone --recursive https://github.com/opencompl/riscv-paper-experiments.git
$ docker run -ti --volume ${PWD}/riscv-paper-experiments:/src ghcr.io/opencompl/snitch-toolchain:latest bash -c "make -C /src"

This builds the kernels, executes them with Verilator, processes the traces from these runs and plots the results. The overall results are collated in the pivoted.csv, pivoted_ipc.csv and pivoted_fpu.csv files. Individual CSV files per kernel directory contain the result in cycles for each version of the kernel.

Note: if on macOS, be aware that the official Docker app is trash, OrbStack is highly recommended instead.

Note: if you're running Docker on an architecture other than x86_64 (e.g.: Apple Silicon), it's likely that your docker run command will complain about the image being linux/amd64. Add the following option to explicitly ask for a specific platform:

$ docker run --platform linux/amd64 ...

Note: we use Snakemake instead of make to manage our test suite. The test workflow is defined in the Snakefile.

To build RISC-V executables of specific tests, once inside the Docker container:

$ cd /src
$ snakemake --cores=1 kernels/saxpy/64xf32/linalg.x
$ ls kernels/saxpy/64xf32/*.x # linalg.x should exist if all went OK

Note: since Snakemake is focused on the amount of resources available to the build process, it requires you to explicitly specify the number of cores you want to use. If you don't do so, it will complain.

The Snakefile performs the following steps:

.mlir -> .ll.mlir (a.k.a. MLIR source that uses the LLVM dialect only) via mlir-opt
.ll.mlir -> .ll via mlir-translate
.ll -> .ll12 (a.k.a. LLVM IR backported to a form compatible with LLVM 12) via tollvm12.py
.ll12 -> .o via clang
.o -> .x via lld, linking the runtime and using the correct linker script to lay out the ELF file.

In visual terms, we can look at an overview of the overall build process for the whole test suite:

Once the ELF executable is ready, we can simulate its execution on a Snitch cluster via the RTL simulator generated by Verilator:

$ cd /src
$ snakemake --cores=1 kernels/saxpy/64xf32/linalg.profile.json

The previous command should produce the following output upon execution:

Wrote 36 bytes of bootrom to 0x1000
Wrote entry point 0x80000000 to bootloader slot 0x1020
Wrote 38 bytes of bootdata to 0x1024
[Tracer] Logging Hart          8 to logs/trace_hart_00000008.dasm
[Tracer] Logging Hart          0 to logs/trace_hart_00000000.dasm
[Tracer] Logging Hart          1 to logs/trace_hart_00000001.dasm
[Tracer] Logging Hart          2 to logs/trace_hart_00000002.dasm
[Tracer] Logging Hart          3 to logs/trace_hart_00000003.dasm
[Tracer] Logging Hart          4 to logs/trace_hart_00000004.dasm
[Tracer] Logging Hart          5 to logs/trace_hart_00000005.dasm
[Tracer] Logging Hart          6 to logs/trace_hart_00000006.dasm
[Tracer] Logging Hart          7 to logs/trace_hart_00000007.dasm

Correct execution should return 0 when issuing echo $?.

Note: while the main function is run by all the compute cores in the cluster, the current startup code returns the integer return value of the core no. 0 only, return values of cores other than no. 0 are discarded.

Along with actual profiling results, the simulation produces execution traces:

$ ls kernels/saxpy/64xf32/linalg.logs/
kernels/saxpy/64xf32/linalg.logs/trace_hart_00000000.trace.txt  # decoded trace
kernels/saxpy/64xf32/linalg.logs/trace_hart_00000000.trace.json # json performance data per section

Note: the current version of spike-dasm included in the Docker image doesn't support Snitch custom instructions: while side effects are reported correctly, they show up as unknown in the decoded trace.

For each snrt_mcycle() call, the cycle timeline for the core is split in two regions, before and after the call. In our test runner where we call snrt_mcycle() right before and after the measured kernel, we have the following sequence of sections with the middle one being the one related to the kernel itself:

$ cat kernels/saxpy/64xf32/linalg.logs/trace_hart_00000000.trace.json
Performance metrics for section 0 @ (12, 719):
# ...

Performance metrics for section 1 @ (721, 1448):
tstart                                         0.0
fpss_loads                                     129
fpss_stores                                     64
tend                                           0.0
snitch_loads                                     0
snitch_avg_load_latency                          0
snitch_occupancy                            0.4332
snitch_fseq_rel_offloads                    0.4923
fseq_yield                                     1.0
fseq_fpu_yield                                 1.0
fpss_section_latency                            36
fpss_avg_fpu_latency                        3.1736
fpss_avg_load_latency                       1.0775
fpss_occupancy                              0.4202
fpss_fpu_occupancy                          0.1584
fpss_fpu_rel_occupancy                      0.3769
cycles                                         764
total_ipc                                   0.8534

Performance metrics for section 2 @ (1450, 3227):
# ...

Along with performance stats, the execution trace shows each instruction's dispatch and retirement along the core's cycle timeline:

$ grep -B2 -A3 1322 kernels/saxpy/64xf32/linalg.logs/trace_hart_00000000.trace.txt
1320 M 0x800001e0 flw     ft0, 0(a5)    #; ft0  <~~ Word[0x100000c4]
1321 M 0x800001e8 flw     ft1, 0(a5)    #; ft1  <~~ Word[0x100001c4], (f:lsu) ft0  <-- 140.3935394
1322 M 0x800001ec fmul.s  ft0, ft0, fa0 #; ft0  = 140.3935394, fa0  = 97.6270065, (f:lsu) ft1  <-- 627.5956421
1323 M 0x800001f4 add     a5, a2, a3    #; a2  = 0x10000200, a3  = 208, (wrb) a5  <-- 0x100002d0
1324 M 0x800001f8 addi    a3, a3, 4     #; a3  = 208, (wrb) a3  <-- 212
1325 M                                  #; (f:fpu) ft0  <-- 13706.2011719

From the decoded trace we can see that we are spending:

2 cycles to load rd/rs1 from L1 (cycles 1320:1321, the constant was already materialized in fa0)
4 cycles to perform the actual multiplication (cycles 1322:1325)

We can also check that the core is doing what we expect:

$ grep fmul\.s kernels/saxpy/64xf32/linalg.logs/trace_hart_00000000.trace.txt  | wc -l
64
$ grep fmul\.s kernels/saxpy/64xf32/linalg.logs/trace_hart_00000001.trace.txt  | wc -l
0

The core (a.k.a. hart in RISC-V terminology) no. 0 was the only one actually executing the kernel, while all of the other cores did none as they early-return from the main function.

Running the full test suite

The root Makefile provides two different targets:

fast (default target), completes as quickly as possible by running only 1 shape for each kernel. The CI runs this test set;
all, runs everything that is needed for paper plots. Warning: ~1300 Verilator simulations will be run, so make sure to be on a beefy machine.

Using Singularity instead of Docker

On multi-user systems (e.g.: HPC clusters) where Docker is not available, Singularity is usually provided as an almost drop-in replacement. To run the experiments harness:

$ singularity pull snitch-toolchain.sif docker://ghcr.io/opencompl/snitch-toolchain:latest
$ singularity run --bind ${PWD}/riscv-paper-experiments:/src snitch-toolchain.sif bash -c "make -C /src"

Performance Measurements

Alongside execution traces, performance reports are produced in the form of json files summing up measurements for the Verilator run at hand. Each counter refers to a specific scope in the Snitch cluster micro-architecture:

All the counters provided for each measurement section are listed below.

Counter	Unit	Scope	Description
`tstart`	cycles	cc	The global simulation time when the `mcycle` instruction opening the current measurement region is issued
`tend`	cycles	cc	The global simulation time when the `mcycle` instruction closing the current measurement region is issued
`start`	cycles	cc	The core complex cycle count when the `mcycle` instruction opening the current measurement region is issued
`end`	cycles	cc	The core complex cycle count when the `mcycle` instruction closing the current measurement region is issued
`end_fpss`	cycles	cc > fpss	The core complex cycle count when the last FP operation issued in the current measurement region retires
`snitch_issues`	inst	cc > snitch	Total number of instructions issued by Snitch, excluding those offloaded to the FPSS (see `snitch_fseq_offloads`)
`snitch_occupancy`	inst/cycle	cc > snitch	IPC of the Snitch core, calculated as `snitch_issues / cycles`
`snitch_fseq_offloads`	inst	cc > snitch	No. of instructions offloaded by the Snitch to the FPSS
`snitch_fseq_rel_offloads`	%	cc > snitch	The ratio between `snitch_fseq_offloads` and the total number of instructions issued by Snitch core proper, i.e. `snitch_issues + snitch_fseq_offloads`
`snitch_load_latency`	cycles	cc > snitch	Cumulative latency of all loads issued by Snitch's own LSU. The latency of a load is measured from the cycle the load is issued to the cycle it is retired, i.e. it writes back to the register file. See `fpss_load_latency`
`snitch_avg_load_latency`	cycles	cc > snitch	Average latency of a load issued by Snitch own LSU (see `snitch_load_latency`)
`snitch_loads`	inst	cc > snitch	No. of load instructions retired by the Snitch own LSU
`fseq_yield`	%	cc > fseq	The ratio between `fpss_issues` and `snitch_fseq_offloads`. The difference lies in the FREP sequencer possibly replicating instructions. If the sequencer is not used this ratio should amount to 1
`fseq_fpu_yield`	%	cc > fseq	FIXME Currently identical to `fseq_yield`, probably a bug in the trace postprocessor. Most likely originally intended to be the ratio between `fpss_fpu_issues` and `snitch_fseq_offloads`
`fpss_issues`	inst	cc > fpss	Total number of instructions issued by the FPSS. It counts repeated issues from the FREP sequencer
`fpss_fpu_issues`	inst	cc > fpss > fpu	Similar to `fpss_issues`, but counts only instructions destined to the FPU proper. It does not for instance include instructions issued to the FPSS own LSU
`fpss_fpu_latency`	cycles	cc > fpss > fpu	Cumulative latency of all FPU instructions. The latency of an FPU instruction is measured from the cycle the instruction is issued to the cycle it is retired, i.e. it writes back to the register file
`fpss_avg_fpu_latency`	cycles	cc > fpss > fpu	Average latency of an FPU instruction (see `fpss_fpu_latency`)
`fpss_load_latency`	cycles	cc > fpss	Cumulative latency of all loads issued by FPSS own LSU. The latency of a load is measured from the cycle the load is issued to the cycle it is retired, i.e. it writes back to the register file. See `snitch_load_latency`
`fpss_avg_load_latency`	cycles	cc > fpss	Average latency of a load issued by FPSS own LSU (see `fpss_load_latency`)
`fpss_loads`	inst	cc > fpss	No. of load instructions retired by the FPSS own LSU
`fpss_section_latency`	cycles	cc > fpss	`max(end_fpss - end, 0)`
`fpss_occupancy`	inst/cycle	cc > fpss	IPC of the FPSS, calculated as `fpss_issues / cycles`
`fpss_fpu_occupancy`	inst/cycle	cc > fpss > fpu	IPC of the FPU, calculated as `fpss_fpu_issues / cycles`
`fpss_fpu_rel_occupancy`	%	cc > fpss > fpu	The ratio between `fpss_fpu_occupancy` and `fpss_occupancy`, equals to `fpss_fpu_issues / fpss_issues`
`cycles`	cycles	cc	Overall cycles spent in the current measurement region, calculated as `max(end, end_fpss) - start + 1`
`total_ipc`	inst/cycle	cc	The overall IPC of the core complex, calculated as `snitch_occupancy + fpss_occupancy`

Name		Name	Last commit message	Last commit date
Latest commit History 836 Commits
.github/workflows		.github/workflows
alexnet		alexnet
docker		docker
docs		docs
kernels		kernels
plots-cgo2025-ae		plots-cgo2025-ae
results		results
scripts		scripts
xdsl @ 902bfbc		xdsl @ 902bfbc
.clang-format		.clang-format
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
Makefile.xdsl		Makefile.xdsl
README.md		README.md
Snakefile		Snakefile
default.yaml		default.yaml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
xdsl_commit.txt		xdsl_commit.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

riscv-paper-experiments

RTL Simulation Quickstart

Running the full test suite

Using Singularity instead of Docker

Performance Measurements

About

Releases

Packages

Contributors 6

Languages

opencompl/riscv-paper-experiments

Folders and files

Latest commit

History

Repository files navigation

riscv-paper-experiments

RTL Simulation Quickstart

Running the full test suite

Using Singularity instead of Docker

Performance Measurements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages