Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup and AVX10 Support #22

Draft
wants to merge 65 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
0ce03c9
Add half precision
davschneller Jul 23, 2024
0716375
Merge remote-tracking branch 'origin/master' into davschneller/half
davschneller Jul 23, 2024
7dc330e
Fix half precision, add full AVX10 support
davschneller Jul 23, 2024
bfb3677
Forward half precision, refactor a bit
davschneller Jul 23, 2024
56287e7
Forward new vector lengths
davschneller Jul 23, 2024
aa29c9b
Prepare for BFLOAT16, AVX512 masks, single-precision NEON
davschneller Aug 6, 2024
9265e89
Fix build
davschneller Aug 6, 2024
0b51b2f
Autoselect bk
davschneller Aug 6, 2024
3dbc276
Fix register counting
davschneller Aug 6, 2024
3ad7718
Add extra strategy for hsw
davschneller Aug 6, 2024
8e604f2
Adjust block condition
davschneller Aug 6, 2024
c51fa7c
Fix register offset
davschneller Aug 6, 2024
7152b9c
Compact dense matrix multiplication
davschneller Aug 6, 2024
e8fb559
Bugfixing
davschneller Aug 6, 2024
1692187
Fix pointers
davschneller Aug 6, 2024
cefc13a
Bugfixes
davschneller Aug 6, 2024
3f67d54
Rollback condition extension
davschneller Aug 6, 2024
f19ea1b
Begin adjusting tests
davschneller Aug 6, 2024
23cbc2c
Prepare for better sparsity loading
davschneller Aug 15, 2024
27e934d
Unroll again
davschneller Aug 15, 2024
927e4f6
Update and extend tests
davschneller Aug 19, 2024
56f9d3d
Test cleanup
davschneller Oct 1, 2024
dae3e37
Fix test names
davschneller Oct 10, 2024
0438eef
Fix bugs in generators
davschneller Oct 10, 2024
2a64dfd
Fix codegen bugs, prepare for AVX512 mask codegen
davschneller Oct 11, 2024
17f23f9
Merge test scripts
davschneller Oct 12, 2024
1d2656a
Adjust tests a bit more, and correct instructions
davschneller Oct 18, 2024
1edd02c
Output more error infos
davschneller Oct 18, 2024
ad887eb
Generalize broadcasting interface a tiny bit
davschneller Oct 19, 2024
b98e8f0
Avoid 0-adds
davschneller Oct 19, 2024
386266b
Unique register clobbering
davschneller Oct 19, 2024
fb70f32
Update blocksize args
davschneller Oct 19, 2024
535c44e
Clean up block sizes
davschneller Oct 19, 2024
97fe82c
Clean up module finder
davschneller Oct 19, 2024
5d92250
Clean up unit tests
davschneller Oct 19, 2024
57d1b47
Prepare for larger-k ARM kernels
davschneller Oct 19, 2024
a321480
Fix ARM kernel register block loading
davschneller Oct 19, 2024
b2ceb79
Add some more comments
davschneller Oct 19, 2024
2fc28a9
Fix unit tests
davschneller Oct 19, 2024
73988b2
Rework test output
davschneller Oct 19, 2024
5b14a9d
Add missing classmethod modifiers
davschneller Oct 19, 2024
985634c
Remove old scripts references
davschneller Oct 19, 2024
9f99416
Fix more class-move bugs
davschneller Oct 19, 2024
e3a4fc2
Fix ARM_SVE codegen
davschneller Oct 19, 2024
07a487b
Yet even more class-move fixes
davschneller Oct 19, 2024
207a373
Add large matrix support and tests
davschneller Oct 19, 2024
fddef2e
Update readme
davschneller Oct 19, 2024
1dab328
Enable AVX512 mask support
davschneller Oct 19, 2024
9bc9831
Fix tests once more
davschneller Oct 19, 2024
8d4f51f
Bump version number
davschneller Oct 19, 2024
88a47fb
Fix large matrix offsetting
davschneller Oct 19, 2024
af3fbda
Fix memory load masking
davschneller Oct 19, 2024
d87a5af
Working larger B registers
davschneller Oct 19, 2024
c480249
Extend inline ARM_SVE broadcast fma
davschneller Oct 20, 2024
8ee2367
Add Cube and MaxK scripts
davschneller Oct 20, 2024
a8a1a62
Towards masking for k
davschneller Oct 20, 2024
096f190
Fix ARM_SVE blocksizes
davschneller Oct 21, 2024
e17d138
Change load order for the B matrix
davschneller Oct 21, 2024
aa95d58
Set MaxK to default
davschneller Oct 25, 2024
6193c09
Bugfixes
davschneller Oct 27, 2024
5fcf5fd
Remove old gitignores
davschneller Oct 27, 2024
13bd622
Fix k predicates
davschneller Oct 30, 2024
9c430fc
Fix vm overhead
davschneller Oct 30, 2024
885c9d4
Fix AVX512 masking
davschneller Oct 30, 2024
a0ce34a
Hotfix overlarge bm
davschneller Oct 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 20 additions & 10 deletions .github/workflows/codegen.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,17 +64,20 @@ jobs:
- name: pspamm-tests-generate
run: |
cd tests/
python unit_tests_hsw.py
python unit_test.py hsw256
python unit_test.py hsw128

- name: pspamm-tests-compile
run: |
cd tests/
g++ -static -mavx512f build/hsw_testsuite.cpp -o build/hsw-test
g++ -static -mavx2 build/hsw256_testsuite.cpp -o build/hsw256-test
g++ -static -mavx2 build/hsw128_testsuite.cpp -o build/hsw128-test

- name: pspamm-tests-run
run: |
cd tests/
qemu-x86_64-static -cpu Haswell build/hsw-test
qemu-x86_64-static -cpu Haswell build/hsw256-test
qemu-x86_64-static -cpu Haswell build/hsw128-test

pspamm-codegen-avx512-no-run:
name: pspamm-codegen-avx512-no-run
Expand Down Expand Up @@ -102,18 +105,24 @@ jobs:
- name: pspamm-tests-generate
run: |
cd tests/
python unit_tests_knl.py
python unit_test.py knl512
python unit_test.py knl256
python unit_test.py knl128

- name: pspamm-tests-compile
run: |
cd tests/
g++ -static -mavx512f build/knl_testsuite.cpp -o build/knl-test
g++ -static -mavx512f build/knl512_testsuite.cpp -o build/knl512-test
g++ -static -mavx512f build/knl256_testsuite.cpp -o build/knl256-test
g++ -static -mavx512f build/knl128_testsuite.cpp -o build/knl128-test

# disabled, since qemu doesn't support AVX512F (yet) with of Ubuntu 24.04
# - name: pspamm-tests-run
# run: |
# cd tests/
# qemu-x86_64-static -cpu Skylake-Server build/knl-test
# qemu-x86_64-static -cpu Skylake-Server build/knl512-test
# qemu-x86_64-static -cpu Skylake-Server build/knl256-test
# qemu-x86_64-static -cpu Skylake-Server build/knl128-test

pspamm-codegen-aarch64:
name: pspamm-codegen-aarch64
Expand Down Expand Up @@ -141,24 +150,25 @@ jobs:
- name: pspamm-tests-generate
run: |
cd tests/
python unit_tests_arm.py
python unit_test.py arm128

- name: pspamm-tests-compile
run: |
cd tests/
aarch64-linux-gnu-g++ -static -march=armv8.2-a build/arm_testsuite.cpp -o build/arm-test
aarch64-linux-gnu-g++ -static -march=armv8.2-a build/arm128_testsuite.cpp -o build/arm128-test

- name: pspamm-tests-run
run: |
cd tests/
qemu-aarch64-static -cpu max build/arm-test
qemu-aarch64-static -cpu max build/arm128-test

pspamm-codegen-armsve:
name: pspamm-codegen-armsve
runs-on: ubuntu-24.04
needs: install-pspamm
# include vector lengths for SVE manually (for now)
strategy:
fail-fast: false
matrix:
vectorlen:
- 128
Expand Down Expand Up @@ -188,7 +198,7 @@ jobs:
- name: pspamm-tests-generate
run: |
cd tests/
python unit_tests_arm_sve.py ${{matrix.vectorlen}}
python unit_test.py arm_sve${{matrix.vectorlen}}

- name: pspamm-tests-compile
run: |
Expand Down
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Code Generator for Sparse Matrix Multiplication
Generates inline-Assembly for sparse Matrix Multiplication.
# PSpaMM
A Code Generator For Small Sparse (and Dense) Matrix Multiplications.

Currently Intel Xeon Phi 'Knights Landing' (AVX512), Haswell/Zen2 (AVX2), and ARM Cortex-A53 (ARMv8) are supported.
Currently supported:

* x86_64: AVX2, AVX512/AVX10.1
* ARM/AARCH64: NEON, SVE (128,256,512,1024,2048 bit)

## Installation

Expand Down
2 changes: 1 addition & 1 deletion pspamm/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.2.2
0.3.0
12 changes: 0 additions & 12 deletions pspamm/architecture.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,5 @@ def init():
generator = None
operands = None



#https://stackoverflow.com/questions/452969/does-python-have-an-equivalent-to-java-class-forname

def get_class( kls ):
return import_module(kls)
parts = kls.split('.')
module = ".".join(parts[:-1])
m = __import__( module )
for comp in parts[1:]:
m = getattr(m, comp)
return m


2 changes: 0 additions & 2 deletions pspamm/codegen/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,5 +55,3 @@ def visitBlock(self, block: Block):
stmt.accept(self)
self.stack.pop()



125 changes: 125 additions & 0 deletions pspamm/codegen/architectures/arm/blocksize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@

class Old:
@classmethod
def getBlocksize(cls, m , n, bk, v_size, prec):

bm = m
bn = n

if cls.ARM_condition(bm, bn, bk, v_size):
while cls.ARM_condition(bm, bn, bk+1, v_size):
bk += 1
return (bm, bn, bk)

while not cls.ARM_condition(bm, bn, bk, v_size):
bm, bn = cls.lowerToNextDiv(m, n, bm, bn, v_size)

while cls.ARM_condition(bm, bn, bk+1, v_size):
bk += 1

return (bm, bn, bk)

@classmethod
def lowerToNextDiv(cls, m, n, bm, bn, v_size):
if bm > bn and bm > v_size:
bm -= v_size
while m % bm != 0:
bm -= v_size
else:
bn -= 1
while n % bn != 0:
bn -= 1

return bm, bn

@classmethod
def ARM_condition(cls, bm, bn, bk, v_size):
# ceiling division
vm = -(bm // -v_size)
return (bn+bk) * vm + bn*bk <= 32

class Max:
@classmethod
def getBlocksize(cls, m, n, bk, v_size, prec):
bm = 2
bn = 1
maxval = 0

for i in range(v_size, m+1, v_size):
for j in range(1, n+1):
if cls.ARM_condition(i, j, bk, v_size):
if i*j > maxval:
maxval = i*j
bm = i
bn = j

while cls.ARM_condition(bm, bn, bk+1, v_size):
bk += 1

return (bm, bn, bk)


@classmethod
def ARM_condition(cls, bm, bn, bk, v_size):
# ceiling division
vm = -(bm // -v_size)
return (bn+bk) * vm + bn*bk <= 32

class MaxK:
@classmethod
def getBlocksize(cls, m, n, bk, v_size, prec):
bm = 2
bn = 1
maxval = 0

elem128 = 16 // prec.size()

for i in range(v_size, m+1, v_size):
for j in range(1, n+1):
if cls.ARM_condition(i, j, bk, v_size, elem128):
if i*j > maxval:
maxval = i*j
bm = i
bn = j

while cls.ARM_condition(bm, bn, bk+1, v_size, elem128):
bk += 1

return (bm, bn, bk)

@classmethod
def ARM_condition(cls, bm, bn, bk, v_size, elem128):
# ceiling division
vm = -(bm // -v_size)
vk = -(bk // -elem128)
return (bn+bk) * vm + bn*vk <= 32

class Cube:
@classmethod
def getBlocksize(cls, m, n, bk, v_size, prec):
bm = 2
bn = 1
maxval = 0

elem128 = 16 // prec.size()

for i in range(v_size, m+1, v_size):
for j in range(1, n+1):
for k in range(1, 200):
if cls.ARM_condition(i, j, k, v_size, elem128):
if i*j*k > maxval:
maxval = i*j*k
bm = i
bn = j
bk = k

return (bm, bn, bk)

@classmethod
def ARM_condition(cls, bm, bn, bk, v_size, elem128):
# ceiling division
vm = -(bm // -v_size)
vk = -(bk // -elem128)
return (bn+bk) * vm + bn*vk <= 32

Default = MaxK
Loading
Loading