-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement finite field ccopy
, neg
, cneg
, nsqr
, ... for CUDA target
#466
Conversation
ccopy
, neg
, cneg
, nsqr
for CUDA targetccopy
, neg
, cneg
, nsqr
, ... for CUDA target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This in general looks good to me, especially the template that allows the GPU impl to not be overly verbose with the assembler context(s).
The _internal
tag is a bit annoying though as it will be everywhere when debugging, optimizing or calling those functions while pub
functions are only when calling the library and even then they are only generated and called in tests through their ctt_
prefix. So I would move the internal functions in impl
, maybe a new impl_fields.nim
that just do the dispatch/wrapping and make sure that when reexporting pub_fields.nim
users can't access internal functions.
There can be a macro that avoid the repetition of making public an impl function.
proc determineDevicePtrs(r, i: NimNode, iTypes: seq[NimNode]): seq[(NimNode, NimNode)] = | ||
## Returns the device pointer ident and its associated original symbol. | ||
for el in r: | ||
#if not el.requiresCopy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that expected to be commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some case where I triggered the error when I didn't expect to. Wanted to investigate at a later time and forgot and now I can't reproduce it. Will have another look and otherwise put it back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remembered the use case where it caused issues. I've now added test cases for execCuda
and changed the logic to a allowedCopy
for this. I.e. if one passes a var
symbol as a result parameter, it is considered allowed (despite not being a ptr T
type potentially). The macro logic now checks if it needs to apply an addr
to the argument or not.
|
||
proc execCudaImpl(jitFn, res, inputs: NimNode): NimNode = | ||
# XXX: we could check for inputs that are literals and produce local variables so | ||
# that we can pass them by reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
literals or staticTy of trivial types.
It's a similar problem to this
constantine/constantine/threadpool/parallel_offloading.nim
Lines 38 to 52 in 4a2dd28
proc needTempStorage(argTy: NimNode): bool = | |
case argTy.kind | |
of nnkVarTy: | |
error("It is unsafe to capture a `var` parameter '" & repr(argTy) & "' and pass it to another thread. Its memory location could be invalidated if the spawning proc returns before the worker thread finishes.") | |
of nnkStaticTy: | |
return false | |
of nnkBracketExpr: | |
if argTy[0].typeKind == ntyTypeDesc: | |
return false | |
else: | |
return true | |
of nnkCharLit..nnkNilLit: | |
return false | |
else: | |
return true |
asy.modadd(fd, ri, ai, bi, M) | ||
asy.br.retVoid() | ||
|
||
asy.callFn(name, [r, a, b]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, there are 3 types of functions:
- underlying primitives
- internal
- public
and for fields, internal
is just forwarding to modular arithmetic primitives
In the past I created an enum with the types of functions:
constantine/constantine/math_compiler/ir.nim
Lines 202 to 210 in cfe077b
Opcode* = enum | |
opFpAdd = "fp_add" | |
opFrAdd = "fr_add" | |
opFpSub = "fp_sub" | |
opFrSub = "fr_sub" | |
opFpMul = "fp_mul" | |
opFrMul = "fr_mul" | |
opFpMulSkipFinalSub = "fp_mul_skip_final_sub" | |
opFrMulSkipFinalSub = "fr_mul_skip_final_sub" |
and I was planning to associate (opcode, curve) -> primitive, to call the proper underlying implementation.
I have yet to find the best solution but I think remove the _internal noise would be nice to improve debugging, perf optimization and dev experience.
Having all those _internal
is annoying and unlike _vartime
there is no security risk to justify it.
I think they should all be moved to the impl_*.nim file and renamed without _internal
.
Public functions wont collide in LLVM due to the ctt_
prefix and the Nim ones won't collide because the internal functions should not be reexported. Which is the case if they are implemented in the pub_*.nim
like now.
## XXX: Similarly to `prod` below, `skipFinalSub = true` here, also breaks the | ||
## code at a different point. | ||
Z2Z2.square(Q.z, skipFinalSub = false) | ||
## XXX: If we set this `skipFinalSub` to true, the code will fail to produce the | ||
## correct result in some cases. Not sure yet why. See `tests/gpu/t_ec_sum.nim`. | ||
S1.prod(Q.z, Z2Z2, skipFinalSub = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either of these cause problems in the code for multiple iterations in the mentioned test case. The two below (Z1Z1 and S2) work fine though. Maybe we want to avoid it everywhere for the time being out of an abundance of caution though.
## XXX: either of these `skipFinalSub = true` breaks the code at some point. | ||
## See also `tests/gpu/t_ec_sum.nim`. | ||
Z1Z1.square(P.z, skipFinalSub = false) #not (cd.coef_a == -3)) | ||
S2.prod(P.z, Z1Z1, skipFinalSub = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above in the non mixed sum
implementation.
tests/gpu/t_ec_sum_port.nim
Outdated
# Test utilities | ||
helpers/prng_unsafe | ||
|
||
proc genSumImpl*(asy: Assembler_LLVM, ed: CurveDescriptor): string = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a debug file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've deleted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of sorts. It's the file of how I ported over the CPU version. I wanted to commit it so that it wouldn't be lost (in the PR branch at least!). Can be useful both for other people trying to get an idea on how to not get overwhelmed porting a more complex algorithm or simply myself in the future.
tests/gpu/t_mul.nim
Outdated
let a = Fp[BN254_Snarks].fromUInt(1'u64) | ||
let b = Fp[BN254_Snarks].fromHex("0x2beb0d0d6115007676f30bcc462fe814bf81198848f139621a3e9fa454fe8e6a") | ||
|
||
testName(Fp[BN254_Snarks], 64, a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails for me, I will have a look
Using CUDA Device [0]: NVIDIA GeForce RTX 3080 Laptop GPU
Compute Capability: SM 8.6
CPU = 0x2beb0d0d6115007676f30bcc462fe814bf81198848f139621a3e9fa454fe8e6a
GPU = 0x02ea4211d4fe77aa8e6f314bf450574e3d16790b8b20238756c6cf8d1a9ed9e8
fatal.nim(53) sysFatal
Error: unhandled exception: t_mul.nim(35, 5) `bool(rCPU == rGPU)` [AssertionDefect]
I assume it's failing due to an Nvidia bug with 64-bit mul: see #348 (comment) / https://forums.developer.nvidia.com/t/incorrect-result-of-ptx-code/221067
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issue on 32-bit, so same bug
E.g. when calling `addIncoming` for φ nodes, one needs to pass the current function.
`nqsr` exists both as a 'compile time' and 'runtime' version, where compile time here refers to the JIT compilation of the CUDA code.
And a very basic load / store example (useful to understand how data is passed to GPU depending on type)
That extra special `r` out of place argument was a bit useless
Because we might not always want to load directly, but rather keep the pointer to some arbitrary (nested) array.
This allows for a bit more sanity wrt differentiating between field points and EC points
We will need 'prime plus 1 div 2' later to implement `div2` for finite field points. The code for the logic is directly ported from the `precompute.nim` logic. Ideally we could avoid the duplication of logic, but well.
We could consider to make all the `_internal` procedures take `EcPoint` etc types. That way overload resolution would not be a problem and we'd avoid more nasty bugs
No need for compile time values and this allows us e.g. to use it for curve coefficients stored in the CurveDescriptor object
We cannot use it in all the same places as for the CPU implementation. There is some kind of hidden bug here, related to remaining state. See the notes in the code, but essentially if we run the `tests/gpu/t_ec_sum.nim` test case after a certain number of iterations of let a, b be EC points res = a while true: res = ec sum(res, b) the code will fail to match the expected CPU result for the same operation. Maybe related to the additional bits available for the big ints storing some problematic data?
i.e. so that we can write `x.addr` as an argument. We pre- and suffix it with back ticks
This extends the existing
add
,mul
,sub
operations on finite fields on the Nvidia target byccopy
,neg
,cneg
,nsqr
. Fornsqr
we have two different implementations for now.I added a simple test case for each operation, which compares with the equivalent operation on the CPU. Currently I just cloned the test file for each case. It'd probably be wise to try to unite them / abstract some of the boilerplate a bit.
Update 2024/09/19
I have now added the following more finite field operations:
setZero
cadd
csub
isZero
isOdd
shiftRight
div2
In addition, I split all the implementations for each finite field op into an internal and public part. That way we can reuse the operations in other ops.
Further, I added distinct
Array
types to deal with finite field points and elliptic curve points. With those, I then added a 'test case' of sorts, where I ported the EC short Weierstrass JacobiansumImpl
logic line by line. With the help of some templates the implementation is essentially the same as for the CPU now and the code produces the same result.Next I'll clean up that test case and add the EC point addition to the
pub_curves.nim
module.Update 2024/09/23
I've now added the EC addition to a new
pub_curves.nim
module. In addition, I added a helper typeNvidiaAssembler
to simplify the Nvidia setup & compilation process. It can now be done with 2 lines:(alternatively one can use the
Assembler_LLVM
part of theNvidiaAssembler
object to build instructions / call a function and then just pass the name of the kernel tocompile
)Using this, the boilerplate in the tests is reduced massively.
Note: The
t_ec_sum_port.nim
file is the one I used to write the port. I think it can be useful to give ideas on how to port some our existing CPU logic to the LLVM target.Next: I'll add other EC operations.
Update 2024/09/26
Over the last few days I have:
EcPoint
type intoEcPointAff
,EcPointJac
andEcPointProj
, which each have their own set of internal / public proceduresfieldOps
,ellipticOps
,ellipticAffOps
that inject multiple templates for operations to allow writing code equivalent to the CPU code that canmixedSum
for Jacobian + Affine EC points. Thanks to the templates, all that was needed to port the code was replace variable declarations and=
assignments byvar X = asy.newEcPointJac(ed)
andstore(X, Y)
. Aside from a minor issue due to anisNeutral
name collision (internal procs are still usingValueRef
as arguments and hence overload resolution is an issue; since renamed the affine version), it worked immediately.