Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION/DOCUMENTATION] Document about reproducibility of computations #545

Open
necessarily-equal opened this issue Nov 1, 2024 · 3 comments

Comments

@necessarily-equal
Copy link

Hi! I have a question about something I was considering using accelerate for scientific computation that we wish to be reproducible.

The goal is to be able to run the same computation again on a different setup and get the same result, e.g. with CUDA on GPU or on X86 CPU.

I found in https://github.com/AccelerateHS/accelerate/blob/master/accelerate.cabal that it is possible to eliminate a large source of platform/optimisation-dependant behavior with -fno-fast-math. That's great. I also saw a -fno-fast-permute-const which I'm not sure what it does but I'm curious about.

Would you say disabling fast-math is enough to have IEEE754 floating point behavior with deterministic order of operations, so that the calculation result can be the same no matter the platform?

(I can make a PR afterwards to try to summarise this discussion in the README or wherever applicable.)

So:

  1. Can I get reproducible floating point results?
  2. (additionally) What's the effect of -fno-fast-permute-const?

Thanks!

@tomsmeding
Copy link
Member

The effect of -fno-fast-permute-const is to disable an assumption that accelerate makes about code using permute const. See the documentation of permute. Normally const is not an allowable combination function for permute anyway (as it is non-commutative), so we assign special behaviour: permute will not actually evaluate the const operation and simply assume that the index mapping function is injective, and directly write the permuted values to memory. -fno-fast-permute-const disables this behaviour, which means that results are still nondeterministic, but the implementation uses synchronisation so that at least multiple (compound) values do not each get half-written to the destination.

Regarding reproducible floating-point results: -fno-fast-math indeed improves things, but parallel algorithms, especially those on GPU, are sometimes non-deterministic by design, and their precise evaluation order depends on device hardware parameters to optimise their performance. In particular, folds and scans will select block sizes that depend on GPU hardware, and these block sizes will probably differ from the ones chosen on a CPU. Hence evaluation order may differ there.

Additionally, the computation model for permute is inherently non-deterministic if the index mapping function is not injective; wanting that to be deterministic puts a huge limit on permute performance.

Wanting precise IEEE754 floating-point behaviour with deterministic evaluation order is largely incompatible with efficient parallel evaluation. I expect that you won't find other parallel array languages that 1. ensure deterministic evaluation order and 2. are actually efficient.

@necessarily-equal
Copy link
Author

necessarily-equal commented Nov 1, 2024 via email

@tomsmeding
Copy link
Member

I am wondering about your specific use-case, and whether Accelerate is the right tool here. :)

Perfect numerical reproducibility across platforms (and even across releases) is simply not a design goal of Accelerate. You may be able to find ways to find a sublanguage of Accelerate that is perfectly reproducible, but depending on exactly what sublanguage that is, we may not be able to give guarantees about future releases or backends.

One sublanguage that is safe from "imperfections" is to avoid floats entirely and work with fixed-point numbers -- would that work for you? That gives you a slowdown, clearly (especially on GPUs which have few integer compute units), but you don't seem terribly concerned about slowdowns.

The main surprising thing about fusion is that it may in some cases duplicate computation; this happens in some cases if a "producer" (e.g. generate, map) is fused into a backpermute or a stencil operation. (Possibly more.) I don't think duplication of computation really makes things non-reproducible; I think (!) you don't need to worry about fusion in your case.

Another point: differing hardware makes differing guarantees about numerical correctness. For example, NVIDIA writes that trigonometric functions may not return exactly the same results as CPU functions do due to differing algorithms, and numerical properties may also depend on exactly which instructions a particular computation is lowered to (e.g. whether an FMA is used).

I do think the most robust way to get what you describe is to avoid floating point numbers entirely, if possible. Any approach using floats will require in-depth checking of the entire stack from high-level language to hardware to see if any of the components may introduce incompatibilities. If you're okay with doing so, then that's on you -- but I wouldn't recommend it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants