- Prerequisites, Installing and Building the GPGPU-Sim v4.2
1.1. Installing prerequisite libraries and simulator
1.2. Building simulator with doxygen files - Tracking Runtime IPC, Instruction Monitoring, and Memory Accesses on L1D, L2, and DRAM
- Tracking Runtime Power Consumption of GPU and Sub-components
- Visualizing Power Consumption, Memory Accesses, and Streaming Multiprocessor Metrics
Detailed documentation on GPGPU-Sim, what GPU models and architectures exist, how to configure it, and a guide to the source code can be found here. Also, detailed documentation on AccelWattch to collect power consumption metrics for subcomponents and a guide to the source code can be found here.
GPGPU-Sim dependencies:
gcc, g++, make, makedepend, xutils, bison, flex, zlib, CUDA Toolkit
(optional) GPGPU-Sim documentation dependencies:doxygen, graphvi
(optional) AerialVision dependencies:python-pmw, python-ply, python-numpy, libpng12-dev, python-matplotlib
CUDA SDK dependencies:libxi-dev, libxmu-dev, libglut3-dev
After installing prerequisite libraries to run the simulator properly, clone the accelWattch implementation of the simulator (GPGPU-Sim 4.2). Then, you should follow the below commands inside the simulator directory to build the simulator.
user@GPPRMon:~$ source setup_environment <build_type>
# That command sets the environment variables so that the simulator can find related executables in the linkage path.
# If you want to debug the simulator (as written in C/C++), you should specify build_type as `debug`.
# Otherwise, you do not need to specify it; blank as empty. It will automatically build the executables with `release` version.
user@GPPRMon:~$ make #To compile source files, create and link the executable files of the simulator.
user@GPPRMon:~$ make clean #To clean the simulator executables
Moreover, if you want to generate documentation files whose dependencies are specified as optional, you must first install the dependencies. Afterward, you can obtain the docs with
user@GPPRMon:~$ make docs # Generates doxygen files describing simulator elements
user@GPPRMon:~$ make cleandocs # Deletes pre-generated doxygen files if they exist.
The generated documentation with doxygen eases the understanding of classes, templates, functions, etc., for the simulator.
During the simulation, the simulator creates memory access information in the below path.
user@GPPRMon/runtime_profiling_metrics/memory_accesses:~$
To enable memory access metric collection, one must specify the below flags in the gpgpusim.config
file.
Flags | Descriptions | Default value |
---|---|---|
-mem_profiler | Enabling collecting memory access metrics | 0 = off |
-mem_runtime_stat | Determining the sampling frequency for the metric collection | 100 (record after each 100 GPU cycles) |
-IPC_per_prof_interval | Recording IPC rates for each metric collection sample | 0 = do not collect |
-instruction_monitor | Recording issue/completion stats of the instructions | 0 = do not collect |
-L1D_metrics | Recording metrics for L1D cache accesses | 0 = do not collect |
-L2_metrics | Recordingollecting metrics for L2 cache accesses | 0 = do not collect |
-DRAM_metrics | Recording metrics for DRAM accesses | 0 = do not collect |
-store_enable | Recording metrics for both store and load instructions | 0 = just record metrics for load |
-accumulate_stats | Accumulating collected metrics | 0 = do not accumulate |
During simulation, the simulator records power consumption metrics in the below path.
user@GPPRMon/runtime_profiling_metrics/energy_consumption:~$
The simulator will create separate folders and power profiling metrics for each kernel at runtime. For now, the power consumption metrics below are supported, but these metrics may be enhanced further to investigate sub-units independently.
GPU
Core
Execution Unit (Register FU, Schedulers, Functional Units, etc.)
Load Store Unit (Crossbar, Shared Memory, Shared Mem Miss/Fill Buffer, Cache, Cache Prefetch Buffer, Cache WriteBack Buffer, Cache Miss Buffer, etc.)
Instruction Functional Unit (Instruction Cache, Branch Target Buffer, Decoder, Branch Predictor, etc.)Network on Chip
L2 Cache
DRAM + Memory ControllerFrontend Engine
PHY Between Memory Controller and DRAM
Transaction Engine (BackEnd Engine)
DRAM
Flags | Descriptions | Default value |
---|---|---|
-power_simulation_enabled | Enabling collecting power consumption metrics | 0 = off |
-gpgpu_runtime_stat | Determining the sampling frequency in terms of GPU cycle | 1000 cycles |
-power_per_cycle_dump | Dumping detailed power output in each sample | 0 = off |
-dvfs_enabled | Turning on/off dynamic voltage frequency scaling for power model | 0 = not enabled |
-aggregate_power_stats | Recording issue/completion stats of the instructions | 0 = do not aggregate |
-steady_power_levels_enabled | Producing a file for the steady power levels | 0 = off |
-steady_state_definition | allowed deviation:number of samples | 8:4 |
-power_trace_enabled | Producing a file for the power trace | 0 = off |
-power_trace_zlevel | Compression level of the power trace output log | 6, (0=no comp, 9=highest) |
-power_simulation_mode | Switch performance counter input for power simulation | 0, (0=Sim, 1=HW, 2=HW-Sim Hybrid) |
Our visualizer tool takes .csv files obtained via runtime simulation of a GPU kernel and generates three different visualization schemes. Currently, the simulator supports GTX480_FERMI, QV100_VOLTA, RTX2060S_TURING, RTX2060_TURING, RTX3070_AMPERE, TITAN_KEPLER, TITANV_VOLTA, TITANX_PASCAL
GPUs currently. As each GPU has a different memory hierarchy, I designed varying schemes for each hierarchy. However, I designed SM and GPU visualizations as one such that their designs are applicable for each GPU.
- A CTA's instruction issue/completion, Power consumption of the corresponding SM, and L1D usage of that SM.
The first visualization displays the instructions for the 1st CTA, which is mapped onto the 1st SM. While PC shows the instruction's PC, Opcode shows the operational codes of the instructions of the 1st thread block. Operands show the register IDs for the corresponding opcode of the instructions.
At the rightmost column (ISSUE/COMPLETION), the visualizer displays the issuing and completion information of the instructions for each warp in the first row and second row, respectively. For example, the first instruction is cvta.to.global.u64, whose PC is 656, is issued at the 55557th cycle by the 7th warp and completed at the 55563rd cycle.
This scheme shows a CTA's issued and completed instructions within a predetermined cycle interval. For the above example, this interval is the [55500, 56000).
In addition, one may see the L1D cache usage and consumed runtime power measurements for the subcomponents of the SMs. The RunTimeDynm parameter represents the total consumed power for each section. Execution, functional and load/store units, and idle-core are the main sub-parts of an SM's power consumption. Also, IPC per SM is displayed at the bottom.
- Access information on the memory units and power consumption of memory controller + DRAM units.
The second visualization shows the accesses on L1D, L2 caches (as hits, hit_reserved_status, misses, reservation_failures, sector_misses, and mshr_hits
) and DRAM partitions (as row buffer hits and row buffer misses
) within the simulator interval. For caches, access descriptions are as follows:
- Hits: Data is found in the corresponding sector of the line.
- Hit Reserved: The line is allocated for the data, but the data does not arrive yet. Data will be located in the corresponding line and sector.
- Misses: Miss is used for a cache line eviction. Line eviction is determined with a dirty counter. Replacement policy is determined via cache configuration.
- Reservation Failures: Whenever an access cannot find the data in the cache, it tries to create a miss request in the MSHR buffer. When there is no slot to hold the corresponding miss in the buffer, it stalls the memory pipeline, and the status for this situation is a reservation failure.
- Sector Misses: When data is not found in the looked sector of the cache line, access status is the sector miss.
- MSHR Hits: When data is not found in the looked sector of the cache line, and the miss request is already located in the MSHR buffer, it is recorded as MSHR hit.
For DRAM, access descriptions are as follows:
- Row Buffer Hits: The data looked for by the current instruction exists in the DRAM row buffer, which holds the element of the last access.
- Row Buffer Misses: The data looked for by the current instruction does not exist in the row buffer.
- GPU Throughput and Power Consumption
GPUs mainly consist of SMs, which include functional units, register files and caches, NoCs, and memory partitions in which DRAM banks and L2 caches exist. For the configured architectures, the number of L1D caches is equal to SMs (SIMT Core Clusters), the number of DRAM banks is equal to the number of memory partitions, and the number of L2 caches is equal to twice the number of memory partitions.
The third visualization shows the on-average L1D, L2 cache, and DRAM access statistics in the Memory Usage Metrics, average IPC among active SMs and Power Consumption Metrics of NoCs, memory partitions of L2 caches and MC+DRAM, and SMs.
In addition to the above runtime visualization options, we provide a display option for the average runtime memory access statistics and IPC vs power dissipation among the units below.
To obtain average runtime memory access statistics and IPC vs power dissipation:
user@GPPRMon/runtime_visualizer:~$ python3 average_disp.py param1 param2 param3 param4
- param1 -> Kernel id
- param2 -> Observation start_interval (cycle)
- param3 -> Observation finish_interval (cycle)
- param4 -> Sampling frequence.
We have experimented with the page ranking algorithm (PR) on GV100 and RTX2060S. Also, we have configured the GPU of Jetson AGX Xavier and Xavier NX GPUs and experimented on them with the Fast Fourier Transform Algorithm. Experimental profiling and displaying results are too large to upload here. However, we hold them on our local servers. If you want, we can send those results. Do not hesitate to contact us for any help and results.