Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target #466

Merged
merged 85 commits into from
Oct 27, 2024

Conversation

Vindaar
Copy link
Collaborator

@Vindaar Vindaar commented Sep 10, 2024

This extends the existing add, mul, sub operations on finite fields on the Nvidia target by ccopy, neg, cneg, nsqr. For nsqr we have two different implementations for now.

  1. generate N squaring operations based on a Nim runtime / JIT compile time value
  2. generate a loop on the GPU using a phi node and branches for JIT runtime values

I added a simple test case for each operation, which compares with the equivalent operation on the CPU. Currently I just cloned the test file for each case. It'd probably be wise to try to unite them / abstract some of the boilerplate a bit.

Update 2024/09/19

I have now added the following more finite field operations:

  • setZero
  • cadd
  • csub
  • `double
  • isZero
  • isOdd
  • shiftRight
  • div2

In addition, I split all the implementations for each finite field op into an internal and public part. That way we can reuse the operations in other ops.

Further, I added distinct Array types to deal with finite field points and elliptic curve points. With those, I then added a 'test case' of sorts, where I ported the EC short Weierstrass Jacobian sumImpl logic line by line. With the help of some templates the implementation is essentially the same as for the CPU now and the code produces the same result.

Next I'll clean up that test case and add the EC point addition to the pub_curves.nim module.

Update 2024/09/23

I've now added the EC addition to a new pub_curves.nim module. In addition, I added a helper type NvidiaAssembler to simplify the Nvidia setup & compilation process. It can now be done with 2 lines:

  let nv = initNvAsm(EC_ShortW_Jac[field, G1], wordSize)  # either takes an EC_ShortW_Jac type or a Fp / Fr type
  let kernel = nv.compile(genEcSum) # pass the proc defining an `llvmPublicFnDef` 

(alternatively one can use the Assembler_LLVM part of the NvidiaAssembler object to build instructions / call a function and then just pass the name of the kernel to compile)

Using this, the boilerplate in the tests is reduced massively.

Note: The t_ec_sum_port.nim file is the one I used to write the port. I think it can be useful to give ideas on how to port some our existing CPU logic to the LLVM target.

Next: I'll add other EC operations.

Update 2024/09/26

Over the last few days I have:

  • split the EcPoint type into EcPointAff, EcPointJac and EcPointProj, which each have their own set of internal / public procedures
  • added more misc. field and EC operations
  • added templates fieldOps, ellipticOps, ellipticAffOps that inject multiple templates for operations to allow writing code equivalent to the CPU code that can
  • ported mixedSum for Jacobian + Affine EC points. Thanks to the templates, all that was needed to port the code was replace variable declarations and = assignments by var X = asy.newEcPointJac(ed) and store(X, Y). Aside from a minor issue due to an isNeutral name collision (internal procs are still using ValueRef as arguments and hence overload resolution is an issue; since renamed the affine version), it worked immediately.

@Vindaar Vindaar changed the title Implement finite field ccopy, neg, cneg, nsqr for CUDA target Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target Sep 19, 2024
Copy link
Owner

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This in general looks good to me, especially the template that allows the GPU impl to not be overly verbose with the assembler context(s).

The _internal tag is a bit annoying though as it will be everywhere when debugging, optimizing or calling those functions while pub functions are only when calling the library and even then they are only generated and called in tests through their ctt_ prefix. So I would move the internal functions in impl, maybe a new impl_fields.nim that just do the dispatch/wrapping and make sure that when reexporting pub_fields.nim users can't access internal functions.

There can be a macro that avoid the repetition of making public an impl function.

proc determineDevicePtrs(r, i: NimNode, iTypes: seq[NimNode]): seq[(NimNode, NimNode)] =
## Returns the device pointer ident and its associated original symbol.
for el in r:
#if not el.requiresCopy:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that expected to be commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some case where I triggered the error when I didn't expect to. Wanted to investigate at a later time and forgot and now I can't reproduce it. Will have another look and otherwise put it back in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remembered the use case where it caused issues. I've now added test cases for execCuda and changed the logic to a allowedCopy for this. I.e. if one passes a var symbol as a result parameter, it is considered allowed (despite not being a ptr T type potentially). The macro logic now checks if it needs to apply an addr to the argument or not.

constantine/math_compiler/codegen_nvidia.nim Show resolved Hide resolved

proc execCudaImpl(jitFn, res, inputs: NimNode): NimNode =
# XXX: we could check for inputs that are literals and produce local variables so
# that we can pass them by reference
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literals or staticTy of trivial types.
It's a similar problem to this

proc needTempStorage(argTy: NimNode): bool =
case argTy.kind
of nnkVarTy:
error("It is unsafe to capture a `var` parameter '" & repr(argTy) & "' and pass it to another thread. Its memory location could be invalidated if the spawning proc returns before the worker thread finishes.")
of nnkStaticTy:
return false
of nnkBracketExpr:
if argTy[0].typeKind == ntyTypeDesc:
return false
else:
return true
of nnkCharLit..nnkNilLit:
return false
else:
return true

constantine/math_compiler/codegen_nvidia.nim Outdated Show resolved Hide resolved
constantine/math_compiler/pub_fields.nim Outdated Show resolved Hide resolved
asy.modadd(fd, ri, ai, bi, M)
asy.br.retVoid()

asy.callFn(name, [r, a, b])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, there are 3 types of functions:

  • underlying primitives
  • internal
  • public

and for fields, internal is just forwarding to modular arithmetic primitives

In the past I created an enum with the types of functions:

Opcode* = enum
opFpAdd = "fp_add"
opFrAdd = "fr_add"
opFpSub = "fp_sub"
opFrSub = "fr_sub"
opFpMul = "fp_mul"
opFrMul = "fr_mul"
opFpMulSkipFinalSub = "fp_mul_skip_final_sub"
opFrMulSkipFinalSub = "fr_mul_skip_final_sub"

and I was planning to associate (opcode, curve) -> primitive, to call the proper underlying implementation.

I have yet to find the best solution but I think remove the _internal noise would be nice to improve debugging, perf optimization and dev experience.
Having all those _internal is annoying and unlike _vartime there is no security risk to justify it.

I think they should all be moved to the impl_*.nim file and renamed without _internal.

Public functions wont collide in LLVM due to the ctt_ prefix and the Nim ones won't collide because the internal functions should not be reexported. Which is the case if they are implemented in the pub_*.nim like now.

constantine/math_compiler/pub_fields.nim Outdated Show resolved Hide resolved
constantine/math_compiler/pub_fields.nim Outdated Show resolved Hide resolved
constantine/math_compiler/pub_fields.nim Outdated Show resolved Hide resolved
Comment on lines +279 to +278
## XXX: Similarly to `prod` below, `skipFinalSub = true` here, also breaks the
## code at a different point.
Z2Z2.square(Q.z, skipFinalSub = false)
## XXX: If we set this `skipFinalSub` to true, the code will fail to produce the
## correct result in some cases. Not sure yet why. See `tests/gpu/t_ec_sum.nim`.
S1.prod(Q.z, Z2Z2, skipFinalSub = false)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either of these cause problems in the code for multiple iterations in the mentioned test case. The two below (Z1Z1 and S2) work fine though. Maybe we want to avoid it everywhere for the time being out of an abundance of caution though.

Comment on lines +537 to +534
## XXX: either of these `skipFinalSub = true` breaks the code at some point.
## See also `tests/gpu/t_ec_sum.nim`.
Z1Z1.square(P.z, skipFinalSub = false) #not (cd.coef_a == -3))
S2.prod(P.z, Z1Z1, skipFinalSub = false)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above in the non mixed sum implementation.

# Test utilities
helpers/prng_unsafe

proc genSumImpl*(asy: Assembler_LLVM, ed: CurveDescriptor): string =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a debug file?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've deleted it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of sorts. It's the file of how I ported over the CPU version. I wanted to commit it so that it wouldn't be lost (in the PR branch at least!). Can be useful both for other people trying to get an idea on how to not get overwhelmed porting a more complex algorithm or simply myself in the future.

let a = Fp[BN254_Snarks].fromUInt(1'u64)
let b = Fp[BN254_Snarks].fromHex("0x2beb0d0d6115007676f30bcc462fe814bf81198848f139621a3e9fa454fe8e6a")

testName(Fp[BN254_Snarks], 64, a, b)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails for me, I will have a look

Using CUDA Device [0]: NVIDIA GeForce RTX 3080 Laptop GPU
Compute Capability: SM 8.6
CPU = 0x2beb0d0d6115007676f30bcc462fe814bf81198848f139621a3e9fa454fe8e6a
GPU = 0x02ea4211d4fe77aa8e6f314bf450574e3d16790b8b20238756c6cf8d1a9ed9e8
fatal.nim(53)            sysFatal
Error: unhandled exception: t_mul.nim(35, 5) `bool(rCPU == rGPU)`  [AssertionDefect]

I assume it's failing due to an Nvidia bug with 64-bit mul: see #348 (comment) / https://forums.developer.nvidia.com/t/incorrect-result-of-ptx-code/221067

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issue on 32-bit, so same bug

E.g. when calling `addIncoming` for φ nodes, one needs to pass the
current function.
`nqsr` exists both as a 'compile time' and 'runtime' version, where
compile time here refers to the JIT compilation of the CUDA code.
And a very basic load / store example (useful to understand how data
is passed to GPU depending on type)
That extra special `r` out of place argument was a bit useless
Because we might not always want to load directly, but rather keep the
pointer to some arbitrary (nested) array.
This allows for a bit more sanity wrt differentiating between field
points and EC points
We will need 'prime plus 1 div 2' later to implement `div2` for finite
field points.

The code for the logic is directly ported from the `precompute.nim`
logic. Ideally we could avoid the duplication of logic, but well.
Vindaar and others added 27 commits October 27, 2024 16:31
We could consider to make all the `_internal` procedures take
`EcPoint` etc types. That way overload resolution would not be a
problem and we'd avoid more nasty bugs
No need for compile time values and this allows us e.g. to use it for
curve coefficients stored in the CurveDescriptor object
We cannot use it in all the same places as for the CPU
implementation. There is some kind of hidden bug here, related to
remaining state.

See the notes in the code, but essentially if we run the
`tests/gpu/t_ec_sum.nim` test case after a certain number of
iterations of

let a, b be EC points
res = a
while true:
  res = ec sum(res, b)

the code will fail to match the expected CPU result for the same
operation.

Maybe related to the additional bits available for the big ints
storing some problematic data?
i.e. so that we can write `x.addr` as an argument. We pre- and suffix
it with back ticks
@mratsim mratsim merged commit 7c3d76d into master Oct 27, 2024
24 checks passed
@mratsim mratsim deleted the moreFFnvidia branch October 27, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants