-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double Free for Ionization (RNG-using) Simulations #1953
Comments
With the "hack" in #1951 I git the following backtrace:
with |
Please run with blocking kernek and cuda-memtest. |
Could you please also provide the error message. Something like invalid memory access or so. |
The function Thus from the kernel call: PMACC_KERNEL(kernel::InitRNGProvider<RNGMethod>{})
(gridSize, blockSize)
(bufferBox, seed, m_size); with:
Surprisingly with the first two values and the previously called line of code const uint32_t gridSize = (m_size.productOfComponents() + blockSize - 1u) / blockSize; // Round up
|
When running
When running picongpu with blocking kernel on, the following error message is given:
without blocking kernel the error message differs:
|
Oh sry my fault I mean |
Maybe it is a issue triggered by our new DataConnector or the change within the Environment. We should check it next week. |
result from
|
Have you enabled the compile flag "show codeline". If not please enable it or do not delete the binary you used. We can extract the line number out of it. |
Do you changed the architecture to sm_35? If not than this is the error. There is a bug in the cmake file that the ptx code is not embadded and if you used the wrong architecture a error like this can be triggered. |
setting 35 causes an error:
and is extremely slow. |
However, the But even |
Using |
Since this is a issue with the ionization I would like to mention @n01r. |
I just did a short test on a k80 node. |
@n01r How did you activate ionization in your simulation? |
edit by @ax3l: and manually setting |
I repeated the test with ADK w/o
|
please run on one gpu and try to reproduce it. After that please run with cuda-gdb and print out the backtrace. |
Yeah, I know the drill - already at it. |
@psychocoderHPC Do you remember that I entered your office a couple of weeks ago to tell you the same thing, that after a completed simulation it would crash? You told me it is a known issue with freeing memory in the cleanup step. I did not try with sm_20 so far, though. Only with sm_35. |
@PrometheusPi: I can reproduce the error with the current dev (LWFA plane) and it is fixed in my test in #1960 (see this test |
This pull request mix two different bugs:
The point 1. is also in addressed in #1954 and fixed with #1960. From this point we use this issue only to discuss the ADK error |
the ADK problem looks a bit like a double free or use after free of the RNG. I will check the output of the new data connector in verbose mode with it:
|
@ax3l Yes the issue is triggered by the data connector because of wrong owner ships. The plain pointer of SolutionWe need to hold the Note: I am currently sick and can not address this issue within the next 2 week. |
yes, I just posted the same 1minute ago above :D currently testing... |
Our RNGFactory should be shared with the `DataConnector` within MySimulation and should not share itself in its constructor.
Fix #1953 RNG Shutdown via DataConnector
should be fixed with #1963 |
While investigating a PIConGPU crash on taurus using ADK as ionization method I stumbled upon a segmentation fault at the end of the default Laser Wakefield example using the
dev
version.In the default setup with ions and ADK it finishes the simulation but runs into a segmentation fault right at the end. With the kernel blocking option, the segmentation fault happens right at start.
I am currently investigating the cause for this crash with
cuda-gdb
.Could any one of you (@ax3l, @psychocoderHPC) verify this bug please?
(Not that is is just a bad module combination I use on hypnos and taurus)
Update: (2017-04-07)
It turns out there are two issues:
sm_20
instead ofsm_35
, bothcuda_memtest
andpicongpu
cause error. Switching to sm_35 solves this issue (this is now moved to issue sm_20 currenly not working for LWFA example #1954)Thus I will rename the topic of this issue to only cover ionization. Please see #1954 for the
sm_20
vssm_35
issue.Update: (2017-04-12)
Modules used:
and own
libSplash
(the current master) at4aa0c039f98295aa75a490ed4fc4df93ae3c9dac
.The text was updated successfully, but these errors were encountered: