Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Vindaar · 2024-09-10T11:58:26Z

This extends the existing add, mul, sub operations on finite fields on the Nvidia target by ccopy, neg, cneg, nsqr. For nsqr we have two different implementations for now.

generate N squaring operations based on a Nim runtime / JIT compile time value
generate a loop on the GPU using a phi node and branches for JIT runtime values

I added a simple test case for each operation, which compares with the equivalent operation on the CPU. Currently I just cloned the test file for each case. It'd probably be wise to try to unite them / abstract some of the boilerplate a bit.

Update 2024/09/19

I have now added the following more finite field operations:

setZero
cadd
csub
`double
isZero
isOdd
shiftRight
div2

In addition, I split all the implementations for each finite field op into an internal and public part. That way we can reuse the operations in other ops.

Further, I added distinct Array types to deal with finite field points and elliptic curve points. With those, I then added a 'test case' of sorts, where I ported the EC short Weierstrass Jacobian sumImpl logic line by line. With the help of some templates the implementation is essentially the same as for the CPU now and the code produces the same result.

Next I'll clean up that test case and add the EC point addition to the pub_curves.nim module.

Update 2024/09/23

I've now added the EC addition to a new pub_curves.nim module. In addition, I added a helper type NvidiaAssembler to simplify the Nvidia setup & compilation process. It can now be done with 2 lines:

  let nv = initNvAsm(EC_ShortW_Jac[field, G1], wordSize)  # either takes an EC_ShortW_Jac type or a Fp / Fr type
  let kernel = nv.compile(genEcSum) # pass the proc defining an `llvmPublicFnDef`

(alternatively one can use the Assembler_LLVM part of the NvidiaAssembler object to build instructions / call a function and then just pass the name of the kernel to compile)

Using this, the boilerplate in the tests is reduced massively.

Note: The t_ec_sum_port.nim file is the one I used to write the port. I think it can be useful to give ideas on how to port some our existing CPU logic to the LLVM target.

Next: I'll add other EC operations.

Update 2024/09/26

Over the last few days I have:

split the EcPoint type into EcPointAff, EcPointJac and EcPointProj, which each have their own set of internal / public procedures
added more misc. field and EC operations
added templates fieldOps, ellipticOps, ellipticAffOps that inject multiple templates for operations to allow writing code equivalent to the CPU code that can
ported mixedSum for Jacobian + Affine EC points. Thanks to the templates, all that was needed to port the code was replace variable declarations and = assignments by var X = asy.newEcPointJac(ed) and store(X, Y). Aside from a minor issue due to an isNeutral name collision (internal procs are still using ValueRef as arguments and hence overload resolution is an issue; since renamed the affine version), it worked immediately.

mratsim

Thank you. This in general looks good to me, especially the template that allows the GPU impl to not be overly verbose with the assembler context(s).

The _internal tag is a bit annoying though as it will be everywhere when debugging, optimizing or calling those functions while pub functions are only when calling the library and even then they are only generated and called in tests through their ctt_ prefix. So I would move the internal functions in impl, maybe a new impl_fields.nim that just do the dispatch/wrapping and make sure that when reexporting pub_fields.nim users can't access internal functions.

There can be a macro that avoid the repetition of making public an impl function.

mratsim · 2024-10-04T20:26:40Z

constantine/math_compiler/codegen_nvidia.nim

+proc determineDevicePtrs(r, i: NimNode, iTypes: seq[NimNode]): seq[(NimNode, NimNode)] =
+  ## Returns the device pointer ident and its associated original symbol.
+  for el in r:
+    #if not el.requiresCopy:


Is that expected to be commented out?

I had some case where I triggered the error when I didn't expect to. Wanted to investigate at a later time and forgot and now I can't reproduce it. Will have another look and otherwise put it back in.

Remembered the use case where it caused issues. I've now added test cases for execCuda and changed the logic to a allowedCopy for this. I.e. if one passes a var symbol as a result parameter, it is considered allowed (despite not being a ptr T type potentially). The macro logic now checks if it needs to apply an addr to the argument or not.

constantine/math_compiler/codegen_nvidia.nim

mratsim · 2024-10-04T20:29:28Z

constantine/math_compiler/codegen_nvidia.nim

+
+proc execCudaImpl(jitFn, res, inputs: NimNode): NimNode =
+  # XXX: we could check for inputs that are literals and produce local variables so
+  # that we can pass them by reference


literals or staticTy of trivial types.
It's a similar problem to this

constantine/constantine/threadpool/parallel_offloading.nim

Lines 38 to 52 in 4a2dd28

proc needTempStorage(argTy: NimNode): bool =

case argTy.kind

of nnkVarTy:

error("It is unsafe to capture a `var` parameter '" & repr(argTy) & "' and pass it to another thread. Its memory location could be invalidated if the spawning proc returns before the worker thread finishes.")

of nnkStaticTy:

return false

of nnkBracketExpr:

if argTy[0].typeKind == ntyTypeDesc:

return false

else:

return true

of nnkCharLit..nnkNilLit:

return false

else:

return true

constantine/math_compiler/codegen_nvidia.nim

constantine/math_compiler/impl_fields_globals.nim

constantine/math_compiler/pub_fields.nim

mratsim · 2024-10-04T21:25:50Z

constantine/math_compiler/pub_fields.nim

+    asy.modadd(fd, ri, ai, bi, M)
+    asy.br.retVoid()
+
+  asy.callFn(name, [r, a, b])


Ah, I see, there are 3 types of functions:

underlying primitives

internal

public

and for fields, internal is just forwarding to modular arithmetic primitives

In the past I created an enum with the types of functions:

constantine/constantine/math_compiler/ir.nim

Lines 202 to 210 in cfe077b

Opcode* = enum

opFpAdd = "fp_add"

opFrAdd = "fr_add"

opFpSub = "fp_sub"

opFrSub = "fr_sub"

opFpMul = "fp_mul"

opFrMul = "fr_mul"

opFpMulSkipFinalSub = "fp_mul_skip_final_sub"

opFrMulSkipFinalSub = "fr_mul_skip_final_sub"

and I was planning to associate (opcode, curve) -> primitive, to call the proper underlying implementation.

I have yet to find the best solution but I think remove the _internal noise would be nice to improve debugging, perf optimization and dev experience.
Having all those _internal is annoying and unlike _vartime there is no security risk to justify it.

I think they should all be moved to the impl_*.nim file and renamed without _internal.

Public functions wont collide in LLVM due to the ctt_ prefix and the Nim ones won't collide because the internal functions should not be reexported. Which is the case if they are implemented in the pub_*.nim like now.

constantine/math_compiler/pub_fields.nim

Vindaar · 2024-10-09T17:10:09Z

constantine/math_compiler/impl_curves_ops_jacobian.nim

+      ## XXX: Similarly to `prod` below, `skipFinalSub = true` here, also breaks the
+      ## code at a different point.
+      Z2Z2.square(Q.z, skipFinalSub = false)
+      ## XXX: If we set this `skipFinalSub` to true, the code will fail to produce the
+      ## correct result in some cases. Not sure yet why. See `tests/gpu/t_ec_sum.nim`.
+      S1.prod(Q.z, Z2Z2, skipFinalSub = false)


Either of these cause problems in the code for multiple iterations in the mentioned test case. The two below (Z1Z1 and S2) work fine though. Maybe we want to avoid it everywhere for the time being out of an abundance of caution though.

Vindaar · 2024-10-09T17:10:33Z

constantine/math_compiler/impl_curves_ops_jacobian.nim

+      ## XXX: either of these `skipFinalSub = true` breaks the code at some point.
+      ## See also `tests/gpu/t_ec_sum.nim`.
+      Z1Z1.square(P.z, skipFinalSub = false) #not (cd.coef_a == -3))
+      S2.prod(P.z, Z1Z1, skipFinalSub = false)


Same as above in the non mixed sum implementation.

mratsim · 2024-10-27T14:33:32Z

tests/gpu/t_ec_sum_port.nim

+  # Test utilities
+  helpers/prng_unsafe
+
+proc genSumImpl*(asy: Assembler_LLVM, ed: CurveDescriptor): string =


Is this a debug file?

For now I've deleted it.

Yes, of sorts. It's the file of how I ported over the CPU version. I wanted to commit it so that it wouldn't be lost (in the PR branch at least!). Can be useful both for other people trying to get an idea on how to not get overwhelmed porting a more complex algorithm or simply myself in the future.

mratsim · 2024-10-27T14:38:41Z

tests/gpu/t_mul.nim

+let a = Fp[BN254_Snarks].fromUInt(1'u64)
+let b = Fp[BN254_Snarks].fromHex("0x2beb0d0d6115007676f30bcc462fe814bf81198848f139621a3e9fa454fe8e6a")
+
+testName(Fp[BN254_Snarks], 64, a, b)


This test fails for me, I will have a look

Using CUDA Device [0]: NVIDIA GeForce RTX 3080 Laptop GPU Compute Capability: SM 8.6 CPU = 0x2beb0d0d6115007676f30bcc462fe814bf81198848f139621a3e9fa454fe8e6a GPU = 0x02ea4211d4fe77aa8e6f314bf450574e3d16790b8b20238756c6cf8d1a9ed9e8 fatal.nim(53) sysFatal Error: unhandled exception: t_mul.nim(35, 5) `bool(rCPU == rGPU)` [AssertionDefect]

I assume it's failing due to an Nvidia bug with 64-bit mul: see #348 (comment) / https://forums.developer.nvidia.com/t/incorrect-result-of-ptx-code/221067

No issue on 32-bit, so same bug

E.g. when calling `addIncoming` for φ nodes, one needs to pass the current function.

`nqsr` exists both as a 'compile time' and 'runtime' version, where compile time here refers to the JIT compilation of the CUDA code.

And a very basic load / store example (useful to understand how data is passed to GPU depending on type)

That extra special `r` out of place argument was a bit useless

Because we might not always want to load directly, but rather keep the pointer to some arbitrary (nested) array.

This allows for a bit more sanity wrt differentiating between field points and EC points

We will need 'prime plus 1 div 2' later to implement `div2` for finite field points. The code for the logic is directly ported from the `precompute.nim` logic. Ideally we could avoid the duplication of logic, but well.

We could consider to make all the `_internal` procedures take `EcPoint` etc types. That way overload resolution would not be a problem and we'd avoid more nasty bugs

No need for compile time values and this allows us e.g. to use it for curve coefficients stored in the CurveDescriptor object

We cannot use it in all the same places as for the CPU implementation. There is some kind of hidden bug here, related to remaining state. See the notes in the code, but essentially if we run the `tests/gpu/t_ec_sum.nim` test case after a certain number of iterations of let a, b be EC points res = a while true: res = ec sum(res, b) the code will fail to match the expected CPU result for the same operation. Maybe related to the additional bits available for the big ints storing some problematic data?

i.e. so that we can write `x.addr` as an argument. We pre- and suffix it with back ticks

…arry 64-bit bug

Vindaar changed the title ~~Implement finite field ccopy, neg, cneg, nsqr for CUDA target~~ Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target Sep 19, 2024

mratsim reviewed Oct 4, 2024

View reviewed changes

Vindaar commented Oct 9, 2024

View reviewed changes

mratsim reviewed Oct 27, 2024

View reviewed changes

Vindaar added 25 commits October 27, 2024 16:31

minor docstring fixes / improvements

fd0916d

borrow isNil for BasicBlockRef

eb0ac54

fix ABI for LLVM neg procedures

5908821

add phi and branch related LLVM ABI functions

f90e21d

inject LLVM fn in llvmFnDef for caller

5beafa9

E.g. when calling `addIncoming` for φ nodes, one needs to pass the current function.

add logic to generate ccopy, neg, cneg, nsqr

d05267d

`nqsr` exists both as a 'compile time' and 'runtime' version, where compile time here refers to the JIT compilation of the CUDA code.

[tests] add test cases for ccopy, neg, cneg, nsqr, nsqrRT

d748ad1

And a very basic load / store example (useful to understand how data is passed to GPU depending on type)

change ccopy to not have additional extra argument

d77948f

That extra special `r` out of place argument was a bit useless

split cneg into internal/public part, fix logic for ccopy change

29aed10

[tests] fix ccopy test case for API change

a684be4

[tests] remove extra pointers in nsqr test

f4f33fe

add setZero internal / public finite field proc

f0d8790

split finite field sub into internal/public procs

d83b065

split finite field mul into internal/public procs

063c488

split finite field nsqr into internal / public procs

125aa6c

add finite field isZero proc

c2825d2

add basic CurveDescriptor type

6d922ec

add asArray overload taking BuilderRef

8c814e8

add more generic getElementPtr

3ea0bfb

Because we might not always want to load directly, but rather keep the pointer to some arbitrary (nested) array.

add wrapper Field, EcPoint that are distinct arrays

c802fad

This allows for a bit more sanity wrt differentiating between field points and EC points

port required code precompute -> impl_field_globas for p+1 div 2

cc3d3e3

We will need 'prime plus 1 div 2' later to implement `div2` for finite field points. The code for the logic is directly ported from the `precompute.nim` logic. Ideally we could avoid the duplication of logic, but well.

add proc to get pointer to the prime plus 1 div 2 value

ae3c5ec

add internal add, fix up docstrings for internal procs

742b71b

add cadd, csub, shiftRight, div2, double, isOdd for Fp

8c90080

[tests] add test case for EC point component retrieval

fe47cb6

Vindaar and others added 27 commits October 27, 2024 16:31

rename isNeutral for affine coords to avoid name clash

a0097f3

We could consider to make all the `_internal` procedures take `EcPoint` etc types. That way overload resolution would not be a problem and we'd avoid more nasty bugs

add ellipticAffOps for affine coordinates

a002f27

adjust EcPointJac templates for added affine template

a2746da

add mixedSum between Jacobian and Affine coordinates

8485346

[tests] add test case for jacobian + affine coords

713918c

[fields] use final reduce only in last iteration, expose finalReduce

82d96ec

[misc] remove int type TODO & forgotten ptrBool

4a9caae

make scalarMul take Nim RT value for LLVM

664b73b

No need for compile time values and this allows us e.g. to use it for curve coefficients stored in the CurveDescriptor object

assign curve coefficients a, b from configureCurve

8261b34

[fields] support to skip final subtraction in mul related ops

4f77693

support other branches of coef_a in EC (mixed) sum

a2d2eeb

move internal LLVM procs to impl files

229a101

remove _internal suffix from procs, repl by "_impl" as string name

988d8d2

add requiresCopy for execCuda macro logic back in

d196b9c

simplify neg by avoiding slct, use select

e279af2

[codegen] allow non ident / sym arguments to execCuda

89df451

i.e. so that we can write `x.addr` as an argument. We pre- and suffix it with back ticks

[codegen] expand requires copy to allow copy for res parameters

66226e5

[codegen] allow literals, consts, gen locals for them, allow ptr T

07357cb

[llvm] wrap (subset of) float types of LLVM

e0b8a19

[tests] add test case for LLVM execCuda with different types of args

fdb9a71

[nimble] add execCuda test case to nimble file

92229f7

[docs] remove outdated TODO and update execCuda docstring

90b39a3

implement setNeutral for EC points on LLVM

8781559

[tests] add WIP test case for misc EC functions

37ef6ed

ec-nvidia: rename internal field/templates

5355be4

ec-nvidia: switch to 32-bit due to upstream fused-multiply-add-with-c…

0d2f0da

…arry 64-bit bug

mratsim force-pushed the moreFFnvidia branch from bea5c1a to 0d2f0da Compare October 27, 2024 15:31

mratsim merged commit 7c3d76d into master Oct 27, 2024
24 checks passed

mratsim deleted the moreFFnvidia branch October 27, 2024 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Vindaar commented Sep 10, 2024 •

edited

Loading

mratsim left a comment

mratsim Oct 4, 2024

Vindaar Oct 9, 2024

Vindaar Oct 10, 2024

mratsim Oct 4, 2024

mratsim Oct 4, 2024

Vindaar Oct 9, 2024

Vindaar Oct 9, 2024

mratsim Oct 27, 2024

mratsim Oct 27, 2024

Vindaar Oct 28, 2024

mratsim Oct 27, 2024

mratsim Oct 27, 2024

	proc needTempStorage(argTy: NimNode): bool =
	case argTy.kind
	of nnkVarTy:
	error("It is unsafe to capture a `var` parameter '" & repr(argTy) & "' and pass it to another thread. Its memory location could be invalidated if the spawning proc returns before the worker thread finishes.")
	of nnkStaticTy:
	return false
	of nnkBracketExpr:
	if argTy[0].typeKind == ntyTypeDesc:
	return false
	else:
	return true
	of nnkCharLit..nnkNilLit:
	return false
	else:
	return true

	Opcode* = enum
	opFpAdd = "fp_add"
	opFrAdd = "fr_add"
	opFpSub = "fp_sub"
	opFrSub = "fr_sub"
	opFpMul = "fp_mul"
	opFrMul = "fr_mul"
	opFpMulSkipFinalSub = "fp_mul_skip_final_sub"
	opFrMulSkipFinalSub = "fr_mul_skip_final_sub"

Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target #466

Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target #466

Conversation

Vindaar commented Sep 10, 2024 • edited Loading

Update 2024/09/19

Update 2024/09/23

Update 2024/09/26

mratsim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Vindaar commented Sep 10, 2024 •

edited

Loading