Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement finite field
ccopy
, neg
, cneg
, nsqr
, ... for CUDA t…
…arget (#466) * minor docstring fixes / improvements * borrow `isNil` for `BasicBlockRef` * fix ABI for LLVM `neg` procedures * add phi and branch related LLVM ABI functions * inject LLVM `fn` in `llvmFnDef` for caller E.g. when calling `addIncoming` for φ nodes, one needs to pass the current function. * add logic to generate ccopy, neg, cneg, nsqr `nqsr` exists both as a 'compile time' and 'runtime' version, where compile time here refers to the JIT compilation of the CUDA code. * [tests] add test cases for `ccopy`, `neg`, `cneg`, `nsqr`, `nsqrRT` And a very basic load / store example (useful to understand how data is passed to GPU depending on type) * change `ccopy` to not have additional extra argument That extra special `r` out of place argument was a bit useless * split `cneg` into internal/public part, fix logic for `ccopy` change * [tests] fix ccopy test case for API change * [tests] remove extra pointers in nsqr test * add `setZero` internal / public finite field proc * split finite field `sub` into internal/public procs * split finite field `mul` into internal/public procs * split finite field `nsqr` into internal / public procs * add finite field `isZero` proc * add basic `CurveDescriptor` type * add `asArray` overload taking `BuilderRef` * add more generic `getElementPtr` Because we might not always want to load directly, but rather keep the pointer to some arbitrary (nested) array. * add wrapper `Field`, `EcPoint` that are distinct arrays This allows for a bit more sanity wrt differentiating between field points and EC points * port required code precompute -> impl_field_globas for p+1 div 2 We will need 'prime plus 1 div 2' later to implement `div2` for finite field points. The code for the logic is directly ported from the `precompute.nim` logic. Ideally we could avoid the duplication of logic, but well. * add proc to get pointer to the prime plus 1 div 2 value * add internal `add`, fix up docstrings for internal procs * add `cadd`, `csub`, `shiftRight`, `div2`, `double`, `isOdd` for Fp * [tests] add test case for EC point component retrieval * [tests] add test case for `mul` I mainly added to debug a bug I saw * [tests] add "test" case implementing EC addition I started with the CPU `sumImpl` template and line by line added each operation for the GPU. With a bunch of helper templates the code essentially looks identical. I checked every line to see if they match. Hence all the commented out `asy.store()` instructions and different proc signatures. * rename `t_ec_sum_impl` to `t_ec_sum_port`, add doc comment Clearer as to what the "test" does and adds a doc comment to the top explaining how it was used * [nvidia] add macro `execCuda` to execute CUDA kernel with var args Deals with deciding if to allocate or just pass by value. Though in a very simple manner! * [nvidia] add NvidiaAssembler helper to simplify init & compile * add elliptic curve implementations for LLVM / Nvidia target * [tests] use NvidiaAssembler in Nvidia tests * [tests] add EC sum on Nvidia test using `pub_curves` * run all nvidia tests for `test_nvidia` nimble task * add overload of `execCuda` without `inputs` argument * add `neg` and `cneg` for EC points * port scalar multiplication with CT integer for finite fields to LLVM * add `double` for EC points * prepare for different EcPoint types, move to `pub_curves_jacobian` EcPoint now is EcPointJac. We will have separate files for different coordinates, like for the CPU code. The distinct Array types will be defined in their respective files. * port precompute code for Montgomery 'One' for LLVM * remove `Assembler_LLVM` argument from `store` for field / EC * add setOne and csetZero for finite fields * add file for LLVM EC points in affine coords w/ isNeutral * add `asEcPointJac` overload taking CurveDescritpor * add `fromAffine` for LLVM * fix type passed in `genEcIsNeutral` for affine coords * add template fieldOps, ellipticOps to simplify operations Slightly more type safe version of the templates included in some of the procs previously. * allocate `limbs` for BigNum in LLVM 'precompute' * remove leftover ptrBool line * fix `setOne` template pointing to `setZero` * fix `setOne` for finite field * improve `derefBool` logic to only maybe deref & raise if not bool Raises at Nim runtime but LLVM code construction time. The exception should never raise under normal conditions, only if the code construction is wrong. * use `derefBool` in other conditionals * add `csetOne` for fields * add `csetOne` for fields * define destructor for `Assembler_LLVM` if ARC/ORC TODO: We still need to differentiate between 2.0 and devel due to `var object` requirement for destroy there iirc * fix global retrieval possibly yielding nil if already called * call destroy for Assembler_LLVM field for NvidiaAssembler * rename `isNeutral` for affine coords to avoid name clash We could consider to make all the `_internal` procedures take `EcPoint` etc types. That way overload resolution would not be a problem and we'd avoid more nasty bugs * add `ellipticAffOps` for affine coordinates * adjust EcPointJac templates for added affine template * add `mixedSum` between Jacobian and Affine coordinates * [tests] add test case for jacobian + affine coords * [fields] use final reduce only in last iteration, expose finalReduce * [misc] remove int type TODO & forgotten ptrBool * make `scalarMul` take Nim RT value for LLVM No need for compile time values and this allows us e.g. to use it for curve coefficients stored in the CurveDescriptor object * assign curve coefficients a, b from `configureCurve` * [fields] support to skip final subtraction in mul related ops * partially use `skipFinalSub` in EC sum We cannot use it in all the same places as for the CPU implementation. There is some kind of hidden bug here, related to remaining state. See the notes in the code, but essentially if we run the `tests/gpu/t_ec_sum.nim` test case after a certain number of iterations of let a, b be EC points res = a while true: res = ec sum(res, b) the code will fail to match the expected CPU result for the same operation. Maybe related to the additional bits available for the big ints storing some problematic data? * support other branches of coef_a in EC (mixed) sum * move internal LLVM procs to `impl` files * remove `_internal` suffix from procs, repl by "_impl" as string name * add `requiresCopy` for `execCuda` macro logic back in * simplify `neg` by avoiding `slct`, use `select` * [codegen] allow non ident / sym arguments to `execCuda` i.e. so that we can write `x.addr` as an argument. We pre- and suffix it with back ticks * [codegen] expand requires copy to allow copy for `res` parameters * [codegen] allow literals, consts, gen locals for them, allow ptr T * [llvm] wrap (subset of) float types of LLVM * [tests] add test case for LLVM execCuda with different types of args * [nimble] add execCuda test case to nimble file * [docs] remove outdated TODO and update execCuda docstring * implement `setNeutral` for EC points on LLVM * [tests] add WIP test case for misc EC functions * ec-nvidia: rename internal field/templates * ec-nvidia: switch to 32-bit due to upstream fused-multiply-add-with-carry 64-bit bug --------- Co-authored-by: Mamy Ratsimbazafy <[email protected]>
- Loading branch information