Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical errors found in convolution backward data kernel using detetctron2 data #1252

Open
iq136boy opened this issue Apr 18, 2024 · 3 comments
Assignees
Labels
bug Something isn't working urgency_blocker blocking feature deliverables

Comments

@iq136boy
Copy link
Contributor

We found numerical errors in convolution backward data kernel when running test with the detectron2 data. I put the data and the error log file here. . The log contains one of the miopen driver command that failed with numerical error. The "conv124_dy.bin" is the output tensor data and "conv124_w.bin" is the weight tensor data.

@bartekxk bartekxk self-assigned this Apr 24, 2024
@junliume junliume added bug Something isn't working urgency_blocker blocking feature deliverables labels May 1, 2024
@bartekxk
Copy link
Contributor

bartekxk commented May 2, 2024

Hi @iq136boy, due to error message:
max err: 0.0001745224, number of errors: 1760, 0.02872243% wrong values I suppose it is catastrophic cancellation. CPU can produce better results due to internal 64/80 bits fpu calculations.
We can handle it in two ways:

  1. Add instances with double as accumulator data type (could cause some slowdown)
  2. Add splitK for backward data. But this solution will be very non-universal (you newer know how to choose split k due to you don't know what values you have)

@iq136boy
Copy link
Contributor Author

iq136boy commented May 2, 2024

@bartekxk Thanks for the update.
For 1, how much the slowdown it could cause?
For 2, is there a default value of the splitK that can handle most of the case? So that the user does not need to choose the value everytime.

@jefyang1
Copy link

I found our ckProfiler also had mismatches between gpu and cpu when using the cmd from this issue. I investigated ckProfiler mismatch and concluded that it could be related to floating point rounding for large tensors. When initializing the tensors with integer values, they matched. It also passed when reducing tensor size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working urgency_blocker blocking feature deliverables
Projects
None yet
Development

No branches or pull requests

5 participants