You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The results you get back when running a model can depend on the device, and can even vary across several calls on the same device. It might be a good idea to add a "Reproducibility" section to the documentation in which we discuss these issues.
For example, let us use the model introduced in w2v2-how-to:
When the output of the model is a class label and not a float value, I guess there is no way to ensure that results are completely reproducible when running of the GPU as we can not limit the precision at the end of the operation and it might be that a database contains some corner cases where we see a class flip when executing again on the GPU.
The same problem we have for regression values. Even if we round to two decimal places, there will always be a few boundary cases for which one model returns e.g. 0.03 and the other 0.02.
It seems very unfortunate, but the only way to achieve reproducibility when running a model the second time or on different machines seems to be to not run it on the GPU.
How do we also achieve reproducibility on GPU? In short, one should not expect full reproducibility across different hardware when running pipelines on GPU as matrix multiplications are less deterministic on GPU than on CPU
The results you get back when running a model can depend on the device, and can even vary across several calls on the same device. It might be a good idea to add a "Reproducibility" section to the documentation in which we discuss these issues.
For example, let us use the model introduced in w2v2-how-to:
Now, let us execute the model on the CPU:
When using the CPU we always get back the same result,
when executing it multiple times.
Then let's switch to the GPU:
We see that we get different results after the fifth decimal place for each run,
and the average result deviates from the CPU based result by:
This is a known ONNX limitation (microsoft/onnxruntime#9704).
In microsoft/onnxruntime#4611 (comment) they propose to select a fixed convolution algorithm to improve this behavior, see also https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking.
With
audonnx
we can achieve this byIt does not really improve results.
It seems that we can only recommend the following when reproducibility is desired:
array([[0.68, 0.65, 0.50]], dtype=float32)
/cc @audeerington
The text was updated successfully, but these errors were encountered: