README: GeMM Simulation Framework

Overview

This project simulates a General Matrix Multiply (GeMM) operation using a custom-designed CPU, memory hierarchy, and instruction set architecture (ISA). The framework includes support for DRAM, a GemmCache for optimizing matrix operations, and a programmable CPU. It facilitates GeMM operations and provides a fully programmable environment for running custom programs that involve memory and arithmetic operations.

The framework is implemented in Python and is designed for educational and experimental purposes, enabling users to explore hardware design principles and low-level programming concepts.

Usage Guide

Setting Up

Dependencies Ensure Python 3.7+ is installed along with numpy for matrix operations.

   pip3 install numpy

Directory Structure

   ├── baseline.py
   ├── cache_test.py
   ├── cpu.py
   ├── dram.py
   ├── gemm_cache.py
   ├── gemm_cache_test.py
   ├── large_matrix.py
   ├── matrix.py
   ├── memory.py
   ├── memory_array.py
   ├── packet.py
   ├── program.py
   └── simple.py

Run Examples Choose a file based on your use case. For example:

python3 simple.py

simple.py for running a simple program to demonstrate basic CPU operations with a standard cache.
baseline.py for initializing matrices and validating GeMM operations with a standard cache.
matrix.py for testing operations on small matrices with GemmCache.
large_matrix.py for testing operations on large matrices with GemmCache.
gemm_cache_test.py or cache_test.py for unit testing specific components.

Key Components

1. DRAM

Class: Dram
Purpose: Simulates main memory, allowing storage and retrieval of data with configurable latencies and burst sizes.
Features:
- Load and store operations.
- Configurable memory size, read latency, write latency, and burst size.
- Out-of-bounds access handling.
- Memory dump functionality for debugging.

2. GemmCache

Class: GemmCache
Purpose: Simulates a cache specifically designed for matrix operations like multiplication and addition.
Features:
- Optimized for matrix operations (matrix_multiply, matrix_add).
- Configurable dimensions, number of matrices, and operation latencies.
- Directly interfaces with DRAM.
- Handles both scalar and matrix data through Packet and MatrixPacket objects.

3. Cache

Class: Cache
Purpose: General-purpose cache for scalar memory operations.
Features:
- Implements a simple block-based cache with eviction logic.
- Supports both read and write operations.
- Interfaces with DRAM for cache misses.
- Eviction logic to ensure cache coherence.

4. CPU

Class: Cpu
Purpose: Simulates a programmable CPU capable of running a custom instruction set.
Features:
- Configurable number of registers and register size.
- Direct interface with memory objects (e.g., DRAM, GemmCache).
- Support for custom instructions, including arithmetic, bitwise, memory, and matrix operations.
- Tracks program execution time (cycles).

5. Program

Class: Program
Purpose: Encapsulates instructions to be executed by the CPU.
Features:
- Customizable instruction set.
- Labels and branching for flow control.
- Support for memory, arithmetic, and matrix operations.
- Converts instructions into a list for execution by the CPU.

6. MemoryArray

Class: MemoryArray
Purpose: Implements a contiguous byte-addressable memory for use in DRAM, caches, and matrix storage.
Features:
- Supports load, store, and delete operations.
- Handles both scalar and multi-byte data.

7. Packets

Classes: Packet, MatrixPacket
Purpose: Encapsulate memory and matrix operation requests.
Features:
- Scalar memory access (Packet).
- Matrix operation requests (MatrixPacket).

Instructions Overview

Special Instructions

Instruction	Purpose	Example
`matrix_multiply`	Perform matrix multiplication in cache	`matrix_multiply(1, 2, 3)`
`matrix_add`	Perform matrix addition in cache	`matrix_add(1, 2, 3)`
`halt`	Stop program execution	`halt()`

Memory Instructions

Instruction	Purpose	Example
`move_memory`	Copy memory block to another address	`move_memory(1, 2, 3)`
`load`	Load data from memory to register	`load(1, 0, 100)`
`store`	Store register data to memory	`store(1, 0, 100)`
`load_byte`	Load byte from memory to register	`load_byte(1, 0, 100)`
`store_byte`	Store lowest register byte data to memory	`store_byte(1, 0, 100)`

ALU Instructions

Instruction	Purpose	Example
`move`	Move immediate value to register	`move(1, 42)`
`add`	Add two registers and store result	`add(3, 1, 2)`
`add_immediate`	Add immediate to register value	`add_immediate(2, 1, 256)`
`multiply`	Multiply two registers and store result	`multiply(3, 1, 2)`
`bitwise_not`	NOT register and store result	`bitwise_not(2, 1)`
`bitwise_or`	OR two registers and store result	`bitwise_or(3, 1, 2)`
`bitwise_and`	AND two registers and store result	`bitwise_and(3, 1, 2)`
`bitwise_xor`	XOR two registers and store result	`bitwise_xor(3, 1, 2)`
`logical_shift_left`	Shift register to left and store result	`logical_shift_left(3, 1, 2)`
`logical_shift_right`	Shift register to right and store result	`logical_shift_right(3, 1, 2)`

Branch Instructions

Instruction	Purpose	Example
`insert_label`	Macro for inserting label for branches/jumps	`insert_label("loop")`
`label_branch_if`	Macro for branching to label	`label_branch_if(1, "loop")`
`label_jump`	Macro for jumping to label	`label_jump(1, "loop")`
`branch_if`	If register, branch to PC plus immediate	`branch_if(1, -10)`
`jump`	Jump to register value	`jump(1)`

Sample Workflow: Matrix Multiply and Add

Initialize DRAM Define matrices A, B, and D in any example script (e.g., baseline.py).

mat_A = np.random.randint(0, 2, size=(4, 4), dtype=np.int8)
mat_B = np.random.randint(0, 2, size=(4, 4), dtype=np.int8)
mat_D = np.random.randint(0, 2, size=(4, 4), dtype=np.int8)

Create a Program Define operations to:

Load matrices into GemmCache.
Perform matrix multiplication and addition.
Store the result back into DRAM.

program = Program(REGISTER_BYTES)
program.matrix_multiply(1, 2, 3)
program.matrix_add(1, 2, 1)
program.halt()

Run the Program Execute the program using the CPU.

cpu.run_program(program)

Validate Results Compare the computed result with the expected output.

assert np.array_equal(mat_C, np.array(matrix_C))

Debugging Tools

print_registers(): Prints the current state of CPU registers.
print_memory(): Dumps the state of memory for debugging.
Matrix Dumps: Verify the state of matrices in DRAM and GemmCache.

Extending the Framework

Add Instructions Define new instructions in Cpu and add their logic in Program.
Optimize GemmCache Modify GemmCache for additional optimizations like tiling or multi-threading.
Integrate More Complex Workloads Extend or create new example scripts for more sophisticated matrix operations.

Known Limitations

Limited support for variable-sized cache blocks.
Designed for single-threaded simulation only.

System Specifications and Heuristics

1. Area Heuristics

Components Considered:
- Number of adders.
- Number of multipliers.
- Number of flip-flops.
- Cache area (from CACTI)
- Scratchpad area (from CACTI)

2. GemmCache

Area Heuristics:
- Each entry of the output matrix uses:
  - 1 processing element, consisting of:
    - 1 8-bit multiplier.
    - 1 8-bit adder.
    - 8 flip-flops.
Matrix Multiply Latency:
- 3d - 1 cycles, where (d) is the height of the square matrix.

Acknowledgments

This project is a simplified educational tool for understanding low-level CPU and memory interactions, inspired by hardware simulation and system design principles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README: GeMM Simulation Framework

Overview

Usage Guide

Setting Up

Key Components

1. DRAM

2. GemmCache

3. Cache

4. CPU

5. Program

6. MemoryArray

7. Packets

Instructions Overview

Special Instructions

Memory Instructions

ALU Instructions

Branch Instructions

Sample Workflow: Matrix Multiply and Add

Debugging Tools

Extending the Framework

Known Limitations

System Specifications and Heuristics

1. Area Heuristics

2. GemmCache

Acknowledgments

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
branch_test.py		branch_test.py
cache_test.py		cache_test.py
cpu.py		cpu.py
dram.py		dram.py
gemm_cache.py		gemm_cache.py
gemm_cache_test.py		gemm_cache_test.py
large_matrix.py		large_matrix.py
matrix.py		matrix.py
memory.py		memory.py
memory_array.py		memory_array.py
packet.py		packet.py
program.py		program.py
simple.py		simple.py

jmgrode/gemm_cache

Folders and files

Latest commit

History

Repository files navigation

README: GeMM Simulation Framework

Overview

Usage Guide

Setting Up

Key Components

1. DRAM

2. GemmCache

3. Cache

4. CPU

5. Program

6. MemoryArray

7. Packets

Instructions Overview

Special Instructions

Memory Instructions

ALU Instructions

Branch Instructions

Sample Workflow: Matrix Multiply and Add

Debugging Tools

Extending the Framework

Known Limitations

System Specifications and Heuristics

1. Area Heuristics

2. GemmCache

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages