Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a section on reproducibility to the docs #61

Open
hagenw opened this issue Mar 20, 2023 · 3 comments
Open

Add a section on reproducibility to the docs #61

hagenw opened this issue Mar 20, 2023 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@hagenw
Copy link
Member

hagenw commented Mar 20, 2023

The results you get back when running a model can depend on the device, and can even vary across several calls on the same device. It might be a good idea to add a "Reproducibility" section to the documentation in which we discuss these issues.

For example, let us use the model introduced in w2v2-how-to:

import audeer
import audonnx
import numpy as np


url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)

np.random.seed(1)
sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)

Now, let us execute the model on the CPU:

>>> model = audonnx.load(model_root, device='cpu')
>>> model(signal, sampling_rate)['logits']
array([[0.6832043 , 0.64673305, 0.49750742]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6832043 , 0.64673305, 0.49750742]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6832043 , 0.64673305, 0.49750742]], dtype=float32)

When using the CPU we always get back the same result,
when executing it multiple times.

Then let's switch to the GPU:

>>> model = audonnx.load(model_root, device='cuda:0')
>>> model(signal, sampling_rate)['logits']
array([[0.68319285, 0.64667934, 0.49738473]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.68317926, 0.6466613 , 0.4974225 ]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.683162  , 0.64668435, 0.4973961 ]], dtype=float32)

We see that we get different results after the fifth decimal place for each run,
and the average result deviates from the CPU based result by:

array([[-2.62856483e-05, -5.79953194e-05, -1.06304884e-04]], dtype=float32)

This is a known ONNX limitation (microsoft/onnxruntime#9704).
In microsoft/onnxruntime#4611 (comment) they propose to select a fixed convolution algorithm to improve this behavior, see also https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking.
With audonnx we can achieve this by

>>> providers = [("CUDAExecutionProvider", {'cudnn_conv_algo_search': 'DEFAULT'})]
>>> model = audonnx.load(model_root, device=providers)
>>> model(signal, sampling_rate)['logits']
array([[0.683191  , 0.64670646, 0.4973919 ]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6830938 , 0.6466217 , 0.49734592]], dtype=float32)
>>> model(signal, sampling_rate)['logits']
array([[0.6831656 , 0.64666504, 0.497427  ]], dtype=float32)

It does not really improve results.

It seems that we can only recommend the following when reproducibility is desired:

  • use CPU as device
  • limit the outcome of the model to two decimal places, e.g. array([[0.68, 0.65, 0.50]], dtype=float32)

/cc @audeerington

@hagenw hagenw added the documentation Improvements or additions to documentation label Mar 20, 2023
@hagenw
Copy link
Member Author

hagenw commented Mar 20, 2023

When the output of the model is a class label and not a float value, I guess there is no way to ensure that results are completely reproducible when running of the GPU as we can not limit the precision at the end of the operation and it might be that a database contains some corner cases where we see a class flip when executing again on the GPU.

@hagenw
Copy link
Member Author

hagenw commented Mar 21, 2023

The same problem we have for regression values. Even if we round to two decimal places, there will always be a few boundary cases for which one model returns e.g. 0.03 and the other 0.02.

It seems very unfortunate, but the only way to achieve reproducibility when running a model the second time or on different machines seems to be to not run it on the GPU.

@hagenw
Copy link
Member Author

hagenw commented Mar 21, 2023

From https://huggingface.co/docs/diffusers/using-diffusers/reproducibility

How do we also achieve reproducibility on GPU? In short, one should not expect full reproducibility across different hardware when running pipelines on GPU as matrix multiplications are less deterministic on GPU than on CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant