Skip to content

Commit

Permalink
Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA t…
Browse files Browse the repository at this point in the history
…arget (#466)

* minor docstring fixes / improvements

* borrow `isNil` for `BasicBlockRef`

* fix ABI for LLVM `neg` procedures

* add phi and branch related LLVM ABI functions

* inject LLVM `fn` in `llvmFnDef` for caller

E.g. when calling `addIncoming` for φ nodes, one needs to pass the
current function.

* add logic to generate ccopy, neg, cneg, nsqr

`nqsr` exists both as a 'compile time' and 'runtime' version, where
compile time here refers to the JIT compilation of the CUDA code.

* [tests] add test cases for `ccopy`, `neg`, `cneg`, `nsqr`, `nsqrRT`

And a very basic load / store example (useful to understand how data
is passed to GPU depending on type)

* change `ccopy` to not have additional extra argument

That extra special `r` out of place argument was a bit useless

* split `cneg` into internal/public part, fix logic for `ccopy` change

* [tests] fix ccopy test case for API change

* [tests] remove extra pointers in nsqr test

* add `setZero` internal / public finite field proc

* split finite field `sub` into internal/public procs

* split finite field `mul` into internal/public procs

* split finite field `nsqr` into internal / public procs

* add finite field `isZero` proc

* add basic `CurveDescriptor` type

* add `asArray` overload taking `BuilderRef`

* add more generic `getElementPtr`

Because we might not always want to load directly, but rather keep the
pointer to some arbitrary (nested) array.

* add wrapper `Field`, `EcPoint` that are distinct arrays

This allows for a bit more sanity wrt differentiating between field
points and EC points

* port required code precompute -> impl_field_globas for p+1 div 2

We will need 'prime plus 1 div 2' later to implement `div2` for finite
field points.

The code for the logic is directly ported from the `precompute.nim`
logic. Ideally we could avoid the duplication of logic, but well.

* add proc to get pointer to the prime plus 1 div 2 value

* add internal `add`, fix up docstrings for internal  procs

* add `cadd`, `csub`, `shiftRight`, `div2`, `double`, `isOdd` for Fp

* [tests] add test case for EC point component retrieval

* [tests] add test case for `mul`

I mainly added to debug a bug I saw

* [tests] add "test" case implementing EC addition

I started with the CPU `sumImpl` template and line by line added each
operation for the GPU. With a bunch of helper templates the code
essentially looks identical.

I checked every line to see if they match. Hence all the commented out
`asy.store()` instructions and different proc signatures.

* rename `t_ec_sum_impl` to `t_ec_sum_port`, add doc comment

Clearer as to what the "test" does and adds a doc comment to the top
explaining how it was used

* [nvidia] add macro `execCuda` to execute CUDA kernel with var args

Deals with deciding if to allocate or just pass by value. Though in a
very simple manner!

* [nvidia] add NvidiaAssembler helper to simplify init & compile

* add elliptic curve implementations for LLVM / Nvidia target

* [tests] use NvidiaAssembler in Nvidia tests

* [tests] add EC sum on Nvidia test using `pub_curves`

* run all nvidia tests for `test_nvidia` nimble task

* add overload of `execCuda` without `inputs` argument

* add `neg` and `cneg` for EC points

* port scalar multiplication with CT integer for finite fields to LLVM

* add `double` for EC points

* prepare for different EcPoint types, move to `pub_curves_jacobian`

EcPoint now is EcPointJac.

We will have separate files for different coordinates, like for the
CPU code. The distinct Array types will be defined in their respective files.

* port precompute code for Montgomery 'One' for LLVM

* remove `Assembler_LLVM` argument from `store` for field / EC

* add setOne and csetZero for finite fields

* add file for LLVM EC points in affine coords w/ isNeutral

* add `asEcPointJac` overload taking CurveDescritpor

* add `fromAffine` for LLVM

* fix type passed in `genEcIsNeutral` for affine coords

* add template fieldOps, ellipticOps to simplify operations

Slightly more type safe version of the templates included in some of
the procs previously.

* allocate `limbs` for BigNum in LLVM 'precompute'

* remove leftover ptrBool line

* fix `setOne` template pointing to `setZero`

* fix `setOne` for finite field

* improve `derefBool` logic to only maybe deref & raise if not bool

Raises at Nim runtime but LLVM code construction time. The exception
should never raise under normal conditions, only if the code
construction is wrong.

* use `derefBool` in other conditionals

* add `csetOne` for fields

* add `csetOne` for fields

* define destructor for `Assembler_LLVM` if ARC/ORC

TODO: We still need to differentiate between 2.0 and devel due to `var
object` requirement for destroy there iirc

* fix global retrieval possibly yielding nil if already called

* call destroy for Assembler_LLVM field for NvidiaAssembler

* rename `isNeutral` for affine coords to avoid name clash

We could consider to make all the `_internal` procedures take
`EcPoint` etc types. That way overload resolution would not be a
problem and we'd avoid more nasty bugs

* add `ellipticAffOps` for affine coordinates

* adjust EcPointJac templates for added affine template

* add `mixedSum` between Jacobian and Affine coordinates

* [tests] add test case for jacobian + affine coords

* [fields] use final reduce only in last iteration, expose finalReduce

* [misc] remove int type TODO & forgotten ptrBool

* make `scalarMul` take Nim RT value for LLVM

No need for compile time values and this allows us e.g. to use it for
curve coefficients stored in the CurveDescriptor object

* assign curve coefficients a, b from `configureCurve`

* [fields] support to skip final subtraction in mul related ops

* partially use `skipFinalSub` in EC sum

We cannot use it in all the same places as for the CPU
implementation. There is some kind of hidden bug here, related to
remaining state.

See the notes in the code, but essentially if we run the
`tests/gpu/t_ec_sum.nim` test case after a certain number of
iterations of

let a, b be EC points
res = a
while true:
  res = ec sum(res, b)

the code will fail to match the expected CPU result for the same
operation.

Maybe related to the additional bits available for the big ints
storing some problematic data?

* support other branches of coef_a in EC (mixed) sum

* move internal LLVM procs to `impl` files

* remove `_internal` suffix from procs, repl by "_impl" as string name

* add `requiresCopy` for `execCuda` macro logic back in

* simplify `neg` by avoiding `slct`, use `select`

* [codegen] allow non ident / sym arguments to `execCuda`

i.e. so that we can write `x.addr` as an argument. We pre- and suffix
it with back ticks

* [codegen] expand requires copy to allow copy for `res` parameters

* [codegen] allow literals, consts, gen locals for them, allow ptr T

* [llvm] wrap (subset of) float types of LLVM

* [tests] add test case for LLVM execCuda with different types of args

* [nimble] add execCuda test case to nimble file

* [docs] remove outdated TODO and update execCuda docstring

* implement `setNeutral` for EC points on LLVM

* [tests] add WIP test case for misc EC functions

* ec-nvidia: rename internal field/templates

* ec-nvidia: switch to 32-bit due to upstream fused-multiply-add-with-carry 64-bit bug

---------

Co-authored-by: Mamy Ratsimbazafy <[email protected]>
  • Loading branch information
Vindaar and mratsim authored Oct 27, 2024
1 parent 2c3de87 commit 7c3d76d
Show file tree
Hide file tree
Showing 28 changed files with 3,782 additions and 47 deletions.
10 changes: 10 additions & 0 deletions constantine.nimble
Original file line number Diff line number Diff line change
Expand Up @@ -612,7 +612,17 @@ const testDesc: seq[tuple[path: string, useGMP: bool]] = @[
]

const testDescNvidia: seq[string] = @[
"tests/gpu/t_load_store.nim",
"tests/gpu/t_exec_literals_consts.nim",
"tests/gpu/t_nvidia_fp.nim",
"tests/gpu/t_mul.nim",
"tests/gpu/t_neg.nim",
"tests/gpu/t_ccopy.nim",
"tests/gpu/t_cneg.nim",
"tests/gpu/t_nsqr.nim",
"tests/gpu/t_nsqr_rt.nim",
"tests/gpu/t_ec_jac_coords.nim",
"tests/gpu/t_ec_sum.nim"
]

const testDescThreadpool: seq[string] = @[
Expand Down
Loading

0 comments on commit 7c3d76d

Please sign in to comment.