Double Free for Ionization (RNG-using) Simulations #1953

PrometheusPi · 2017-04-06T13:42:13Z

While investigating a PIConGPU crash on taurus using ADK as ionization method I stumbled upon a segmentation fault at the end of the default Laser Wakefield example using the dev version.
In the default setup with ions and ADK it finishes the simulation but runs into a segmentation fault right at the end. With the kernel blocking option, the segmentation fault happens right at start.

I am currently investigating the cause for this crash with cuda-gdb.

Could any one of you (@ax3l, @psychocoderHPC) verify this bug please?
(Not that is is just a bad module combination I use on hypnos and taurus)

Update: (2017-04-07)
It turns out there are two issues:

If using sm_20 instead of sm_35, both cuda_memtest and picongpu cause error. Switching to sm_35 solves this issue (this is now moved to issue sm_20 currenly not working for LWFA example #1954)
If using atomic Hydrogen, we still see an out of memory error even for very small simulation volumes.

Thus I will rename the topic of this issue to only cover ionization. Please see #1954 for the sm_20 vs sm_35 issue.

Update: (2017-04-12)
Modules used:

  1) gcc/4.9.2
  2) cmake/3.3.0  
  3) boost/1.60.0   
  4) cuda/8.0  
  5) openmpi/1.8.6.kepler.cuda80   
  6) pngwriter/0.5.6
  7) hdf5-parallel/1.8.15

and own libSplash (the current master) at 4aa0c039f98295aa75a490ed4fc4df93ae3c9dac.

The text was updated successfully, but these errors were encountered:

PrometheusPi · 2017-04-06T14:25:54Z

With the "hack" in #1951 I git the following backtrace:

#0  0x00007ffff48a4c37 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff48a8028 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff50be9ed in __gnu_cxx::__verbose_terminate_handler() () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#3  0x00007ffff50bc986 in __cxxabiv1::__terminate(void (*)()) () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#4  0x00007ffff50bc9d1 in std::terminate() () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#5  0x00007ffff50bcc18 in __cxa_throw () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#6  0x00000000009e1b76 in PMacc::exec::KernelStarter<PMacc::exec::Kernel<PMacc::random::kernel::InitRNGProvider<PMacc::random::methods::XorMin> >, unsigned int, unsigned int>::operator()<PMacc::DataBox<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >, unsigned int, PMacc::DataSpace<3u> > (this=this@entry=0x7fffffffc9c0)
    at ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp:217
#7  0x00000000009e24f9 in operator()<PMacc::DataBox<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >, unsigned int, PMacc::DataSpace<3u> > (this=0x7fffffffc990)
    at ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp:241
#8  PMacc::random::RNGProvider<3u, PMacc::random::methods::XorMin>::init (this=0x7fffe8179520, seed=0)
    at ___/src/libPMacc/include/random/RNGProvider.tpp:76
#9  0x0000000000a1296d in picongpu::MySimulation::init (this=0x217f0f0)
    at ___/src/picongpu/include/simulationControl/MySimulation.hpp:326
#10 0x00000000009b36d8 in PMacc::SimulationHelper<3u>::startSimulation (this=0x217f0f0)
    at ___/src/libPMacc/include/simulationControl/SimulationHelper.hpp:215
#11 0x00000000009b3f04 in picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::start (this=this@entry=0x7fffffffd1c0)
    at ___/src/picongpu/include/simulationControl/SimulationStarter.hpp:81
#12 0x00000000008b81d9 in main (argc=11, argv=0x7fffffffd2f8)
    at ___/src/picongpu/main.cu:56

with ___ being the path to the source

psychocoderHPC · 2017-04-06T14:53:05Z

Please run with blocking kernek and cuda-memtest.

psychocoderHPC · 2017-04-06T14:54:31Z

Could you please also provide the error message. Something like invalid memory access or so.

PrometheusPi · 2017-04-06T14:59:35Z

The function void RNGProvider<T_dim, T_RNGMethod>::init(uint32_t seed) is called several times (while according to gdb, not all calls start a kernel).
The value of gridSize is optimized out till the crash, but right before the crash it is set to 16384.

Thus from the kernel call:

PMACC_KERNEL(kernel::InitRNGProvider<RNGMethod>{})
    (gridSize, blockSize)
    (bufferBox, seed, m_size);

with:

gridSize = 16384
blockSize = 256
bufferBox = {<PMacc::private_Box::Box<3u, PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >> = {<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u>> = {pitch = 3072, pitch2D = 786432, fixedPointer = 0x131f980000}, <No data fields>}, <No data fields>}
seed = 0
m_size = {<PMacc::math::Vector<int, 3, PMacc::math::StandardAccessor, PMacc::math::StandardNavigator, PMacc::math::detail::Vector_components>> = {<PMacc::math::detail::Vector_components<int, 3>> = {static isConst = <optimized out>, static dim = 3, v = {128, 256, 128}}, <PMacc::math::StandardAccessor> = {<No data fields>}, <PMacc::math::StandardNavigator> = {<No data fields>}, static dim = 3}, static Dim = <optimized out>}

Surprisingly with the first two values and the previously called line of code

const uint32_t gridSize = (m_size.productOfComponents() + blockSize - 1u) / blockSize; // Round up

m_size.productOfComponents() = 4194049
which is a number with the prime factors 53 * 79133 which looks kind of weird for the typical data structure used in PIConGPU.
Further it makes absolutely no sense if this is a product of more than 2 components (and m_size is 3 dimensional).

PrometheusPi · 2017-04-06T20:12:49Z

@psychocoderHPC

When running cudamemtest, there is an error (on various k20/hypnos nodes)

mpiexec --prefix $MPIHOME -tag-output --display-map -x LIBRARY_PATH -x LD_LIBRARY_PATH -am  .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/cuda_memtest.sh
 Data for JOB [59174,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004	Num slots: 4	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [59174,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:[04/06/2017 21:55:21][kepler004][0]:ERROR: CUDA error: invalid device function, line 312, file ___/thirdParty/cuda_memtest/tests.cu
[1,0]<stderr>:cuda_memtest crash: see file ___/002_test_1GPU/simOutput/cuda_memtest_kepler004_0.err
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[59174,1],0]
  Exit code:    1
--------------------------------------------------------------------------

When running picongpu with blocking kernel on, the following error message is given:

mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -tag-output --display-map -am  .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/picongpu  -d 1 1 1                         -g 128 256 128                        -s 10
 Data for JOB [58224,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004	Num slots: 4	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [58224,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
[1,0]<stderr>:[CUDA] Error: < ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N5PMacc6random6kernel15InitRNGProviderINS0_7methods6XorMinEEE [ ___/src/libPMacc/include/random/RNGProvider.tpp:76 ]
[1,0]<stderr>:terminate called after throwing an instance of 'std::runtime_error'
[1,0]<stderr>:  what():  [CUDA] Error: invalid device function
[1,0]<stderr>:[kepler004:25014] *** Process received signal ***
[1,0]<stderr>:[kepler004:25014] Signal: Aborted (6)
[1,0]<stderr>:[kepler004:25014] Signal code:  (-6)
[1,0]<stderr>:[kepler004:25014] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7ff4adbfc330]
[1,0]<stderr>:[kepler004:25014] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7ff4aa8d4c37]
[1,0]<stderr>:[kepler004:25014] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7ff4aa8d8028]
[1,0]<stderr>:[kepler004:25014] [ 3] [1,0]<stderr>:/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7ff4ab0ee9ed]
[1,0]<stderr>:[kepler004:25014] [ 4] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d986)[0x7ff4ab0ec986]
[1,0]<stderr>:[kepler004:25014] [ 5] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d9d1)[0x7ff4ab0ec9d1]
[1,0]<stderr>:[kepler004:25014] [ 6] [1,0]<stderr>:/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5dc18)[0x7ff4ab0ecc18]
[1,0]<stderr>:[kepler004:25014] [ 7] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_+0x706)[0x9e1b76]
[1,0]<stderr>:[kepler004:25014] [ 8] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj+0x109)[0x9e24f9]
[1,0]<stderr>:[kepler004:25014] [ 9] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu12MySimulation4initEv+0x5ad)[0xa1296d]
[1,0]<stderr>:[kepler004:25014] [10] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv+0x18)[0x9b36d8]
[1,0]<stderr>:[kepler004:25014] [11] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xc4)[0x9b3f04]
[1,0]<stderr>:[kepler004:25014] [12] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(main+0x99)[0x8b81d9]
[1,0]<stderr>:[kepler004:25014] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7ff4aa8bff45]
[1,0]<stderr>:[kepler004:25014] [14] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu[0x8b856f]
[1,0]<stderr>:[kepler004:25014] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 25014 on node kepler004 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

without blocking kernel the error message differs:

mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -tag-output --display-map -am .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/picongpu  -d 1 1 1                         -g 128 256 128                        -s 10
 Data for JOB [63601,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004	Num slots: 4	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [63601,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000578698
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 16777216
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
[1,0]<stdout>:initialization time:  1sec 934msec = 1 sec
[1,0]<stdout>:  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
[1,0]<stdout>: 10 % =        1 | time elapsed:                    4msec | avg time per step:   4msec
[1,0]<stdout>: 20 % =        2 | time elapsed:                    6msec | avg time per step:   2msec
[1,0]<stdout>: 30 % =        3 | time elapsed:                    8msec | avg time per step:   2msec
[1,0]<stdout>: 40 % =        4 | time elapsed:                   11msec | avg time per step:   2msec
[1,0]<stdout>: 50 % =        5 | time elapsed:                   13msec | avg time per step:   2msec
[1,0]<stdout>: 60 % =        6 | time elapsed:                   15msec | avg time per step:   2msec
[1,0]<stdout>: 70 % =        7 | time elapsed:                   18msec | avg time per step:   2msec
[1,0]<stdout>: 80 % =        8 | time elapsed:                   20msec | avg time per step:   2msec
[1,0]<stdout>: 90 % =        9 | time elapsed:                   22msec | avg time per step:   2msec
[1,0]<stdout>:100 % =       10 | time elapsed:                   25msec | avg time per step:   2msec
[1,0]<stdout>:calculation  simulation time:  25msec = 0 sec
[1,0]<stderr>:[kepler004:31414] *** Process received signal ***
[1,0]<stderr>:[kepler004:31414] Signal: Segmentation fault (11)
[1,0]<stderr>:[kepler004:31414] Signal code: Address not mapped (1)
[1,0]<stderr>:[kepler004:31414] Failing at address: 0x31
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 31414 on node kepler004 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

psychocoderHPC · 2017-04-06T20:18:54Z

Oh sry my fault I mean cuda-memcheck. This should show the wronh line within the gpu kernel.

psychocoderHPC · 2017-04-06T20:21:05Z

Maybe it is a issue triggered by our new DataConnector or the change within the Environment. We should check it next week.

PrometheusPi · 2017-04-06T20:28:16Z

result from cuda-memtest

cuda-memcheck  .../002_test_1GPU/picongpu/bin/picongpu -d 1 1 1   -g 128 256 128    -s 10
========= CUDA-MEMCHECK
========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaSetDeviceFlags. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ca610]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6detail18EnvironmentContext9setDeviceEi + 0x1fd) [0x4d68dd]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc11EnvironmentILj3EE11initDevicesENS_9DataSpaceILj3EEES3_ + 0x9c) [0x5edd8c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation10pluginLoadEv + 0x1bc) [0x5f501c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE10pluginLoadEv + 0x2b) [0x4fc4fb]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x8c) [0x4b81cc]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ba703]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6detail18EnvironmentContext9setDeviceEi + 0x267) [0x4d6947]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc11EnvironmentILj3EE11initDevicesENS_9DataSpaceILj3EEES3_ + 0x9c) [0x5edd8c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation10pluginLoadEv + 0x1bc) [0x5f501c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE10pluginLoadEv + 0x2b) [0x4fc4fb]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x8c) [0x4b81cc]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6bcfde]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6nvidia16gpuEntryFunctionINS_6random6kernel15InitRNGProviderINS2_7methods6XorMinEEEINS_7DataBoxINS_10PitchedBoxINS6_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvT_DpT0_ + 0x8b) [0x4e7b6b]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_ + 0x34f) [0x5e17bf]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj + 0x109) [0x5e24f9]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation4initEv + 0x5ad) [0x61296d]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv + 0x18) [0x5b36d8]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv + 0xc4) [0x5b3f04]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x99) [0x4b81d9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ba703]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_ + 0x29f) [0x5e170f]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj + 0x109) [0x5e24f9]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation4initEv + 0x5ad) [0x61296d]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv + 0x18) [0x5b36d8]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv + 0xc4) [0x5b3f04]
[CUDA] Error: <=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x99) [0x4b81d9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
 ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N5PMacc6random6kernel15InitRNGProviderINS0_7methods6XorMinEEE [ ___/src/libPMacc/include/random/RNGProvider.tpp:76 ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] Error: invalid device function
[kepler004:03512] *** Process received signal ***
[kepler004:03512] Signal: Aborted (6)
[kepler004:03512] Signal code:  (-6)
[kepler004:03512] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f31e092c330]
[kepler004:03512] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f31dd604c37]
[kepler004:03512] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f31dd608028]
[kepler004:03512] [ 3] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7f31dde1e9ed]
[kepler004:03512] [ 4] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d986)[0x7f31dde1c986]
[kepler004:03512] [ 5] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d9d1)[0x7f31dde1c9d1]
[kepler004:03512] [ 6] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5dc18)[0x7f31dde1cc18]
[kepler004:03512] [ 7]  .../002_test_1GPU/picongpu/bin/picongpu(_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_+0x706)[0x9e1b76]
[kepler004:03512] [ 8]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj+0x109)[0x9e24f9]
[kepler004:03512] [ 9]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu12MySimulation4initEv+0x5ad)[0xa1296d]
[kepler004:03512] [10]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv+0x18)[0x9b36d8]
[kepler004:03512] [11]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xc4)[0x9b3f04]
[kepler004:03512] [12]  .../002_test_1GPU/picongpu/bin/picongpu(main+0x99)[0x8b81d9]
[kepler004:03512] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f31dd5eff45]
[kepler004:03512] [14]  .../002_test_1GPU/picongpu/bin/picongpu[0x8b856f]
[kepler004:03512] *** End of error message ***
========= Error: process didn't terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

psychocoderHPC · 2017-04-06T20:31:31Z

Have you enabled the compile flag "show codeline". If not please enable it or do not delete the binary you used. We can extract the line number out of it.

psychocoderHPC · 2017-04-06T20:35:20Z

Do you changed the architecture to sm_35? If not than this is the error. There is a bug in the cmake file that the ptx code is not embadded and if you used the wrong architecture a error like this can be triggered.
I will fix the cmake file next week.

PrometheusPi · 2017-04-06T21:35:01Z

setting 35 causes an error:

  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
[CUDA] Error: < ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N8picongpu9particles10ionization21KernelIonizeParticlesE [ ___/src/picongpu/include/particles/ParticlesFunctors.hpp:313 ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] Error: out of memory

and is extremely slow.

PrometheusPi · 2017-04-06T21:41:48Z

However, the cuda_memtest.sh runs now successfully.

But even .../002_test_1GPU/picongpu/bin/picongpu -d 1 1 1 -g 32 32 32 -s 1 crashes with the above memory error.

PrometheusPi · 2017-04-06T22:01:24Z

Using sm_35 solves the issue with running the examples with electrons only and electrons+ions (pre-ionized). But with ions both BSI and ADK crash.

PrometheusPi · 2017-04-07T08:37:31Z

Since this is a issue with the ionization I would like to mention @n01r.

n01r · 2017-04-07T10:02:25Z

I just did a short test on a k80 node.
Configured a LaserWakefield example from dev with -t 10 which activates ions and ionization. I used BSIHydrogenLike and kicked out the effectiveAtomicNumbers for this test.
I ran mpiexec -n 8 picongpu -d 2 2 2 -g 64 64 64 -s 2000 and received no errors.

PrometheusPi · 2017-04-07T12:24:46Z

@n01r How did you activate ionization in your simulation?
with PARAM_IONIZATION == 1 ?

n01r · 2017-04-07T12:42:01Z

$PICSRC/configure -t 10 ~/paramSets/088_Issue1953BugOutOfMemoryWithIonization
whereas the cmake flag set 10 contains
flags[10]="-DCUDA_ARCH=35 -DPARAM_OVERWRITES:LIST=-DPARAM_IONS=1;-DPARAM_IONIZATION=1"

edit by @ax3l: and manually setting BSIEffectiveZ to ADKLinPol: #1960 (comment)

n01r · 2017-04-07T13:11:22Z

I repeated the test with ADK w/o PMACC_BLOCKING_KERNEL.
This time I get an error:

 95 % =     1900 | time elapsed:            14sec 637msec | avg time per step:   7msec
100 % =     2000 | time elapsed:            15sec 390msec | avg time per step:   7msec
calculation  simulation time: 15sec 391msec = 15 sec
[kepler020:22371] *** Process received signal ***
[kepler020:22371] Signal: Segmentation fault (11)
[kepler020:22371] Signal code: Address not mapped (1)
[kepler020:22371] Failing at address: 0x30
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 22371 on node kepler020 exited on signal 11 (Segmentation fault).

psychocoderHPC · 2017-04-07T13:13:48Z

please run on one gpu and try to reproduce it. After that please run with cuda-gdb and print out the backtrace.

n01r · 2017-04-07T13:18:17Z

Yeah, I know the drill - already at it.
I only posted the 8 GPU test and did already 1 GPU tests for both and now I'm compiling with the debug flags.

n01r · 2017-04-07T13:56:51Z

@psychocoderHPC Do you remember that I entered your office a couple of weeks ago to tell you the same thing, that after a completed simulation it would crash? You told me it is a known issue with freeing memory in the cleanup step.

I did not try with sm_20 so far, though. Only with sm_35.

psychocoderHPC · 2017-04-12T07:43:20Z

@n01r The cleanup issues you mean was fixed with #1886. It was during the time @ax3l refactored the PMacc::DataConnector

psychocoderHPC · 2017-04-12T12:30:53Z

@PrometheusPi: I can reproduce the error with the current dev (LWFA plane) and it is fixed in my test in #1960 (see this test

psychocoderHPC · 2017-04-12T12:48:01Z

This pull request mix two different bugs:

compiling LWFA with sm_20 and run on sm_3x result in [CUDA] Error: invalid device function
compiling LWFA+ADK with sm_35 and run on sm_35 result invalid memory access

The point 1. is also in addressed in #1954 and fixed with #1960.

From this point we use this issue only to discuss the ADK error

ax3l · 2017-04-12T12:56:45Z

the ADK problem looks a bit like a double free or use after free of the RNG. I will check the output of the new data connector in verbose mode with it:

[...]
calculation  simulation time: 102msec = 0 sec
PMaccVerbose MEMORY(1) | DataConnector: unshared 'MallocMCBuffer' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: being cleaned (7 datasets left to unshare)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'i' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'e' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'RNGProvider3XorMin' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'FieldTmp0' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'J' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'E' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'B' (0 uses left)
[kepler020:08170] *** Process received signal ***
[kepler020:08170] Signal: Segmentation fault (11)
[kepler020:08170] Signal code: Address not mapped (1)
[kepler020:08170] Failing at address: 0x35
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 8170 on node kepler020 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

trying to remove __delete(rngFactory); now

psychocoderHPC · 2017-04-12T13:18:07Z

@ax3l Yes the issue is triggered by the data connector because of wrong owner ships.

The plain pointer of RNGProvider is deleted in here in MySimulation and later on the shared pointer is also freed from the DataConnector.

Solution

We need to hold the RNGProvider in MySimulation as shared Pointer and must trigger the share function from MySimulation and not from the class it self.

Note: I am currently sick and can not address this issue within the next 2 week.

ax3l · 2017-04-12T13:18:45Z

yes, I just posted the same 1minute ago above :D currently testing...

Our RNGFactory should be shared with the `DataConnector` within MySimulation and should not share itself in its constructor.

Fix #1953 RNG Shutdown via DataConnector

ax3l · 2017-04-13T15:06:23Z

should be fixed with #1963

PrometheusPi added bug a bug in the project's code question labels Apr 6, 2017

PrometheusPi added this to the Future milestone Apr 6, 2017

PrometheusPi self-assigned this Apr 6, 2017

PrometheusPi mentioned this issue Apr 7, 2017

sm_20 currenly not working for LWFA example #1954

Closed

PrometheusPi changed the title ~~Segmentation fault at end of Laser Wakefield example~~ Out-of-memory in Laser Wakefield example with ionization Apr 7, 2017

ax3l modified the milestones: Next Stable: 0.3.0, Future Apr 7, 2017

ax3l added component: core in PIConGPU (core application) component: examples PIConGPU or PMacc examples and removed question labels Apr 7, 2017

psychocoderHPC mentioned this issue Apr 12, 2017

fix missing ptx code within the binary #1960

Merged

2 tasks

n01r mentioned this issue Apr 12, 2017

Out-of-memory in Bremsstrahlung example #1961

Closed

ax3l assigned ax3l and unassigned PrometheusPi Apr 12, 2017

ax3l added a commit to ax3l/picongpu that referenced this issue Apr 12, 2017

Fix ComputationalRadiationPhysics#1953 RNG Shutdown via DataConnector

691aed6

Our RNGFactory should be shared with the `DataConnector` within MySimulation and should not share itself in its constructor.

ax3l removed the component: examples PIConGPU or PMacc examples label Apr 12, 2017

ax3l mentioned this issue Apr 12, 2017

Fix #1953 RNG Shutdown via DataConnector #1963

Merged

ax3l changed the title ~~Out-of-memory in Laser Wakefield example with ionization~~ Double Free for Ionization (RNG-using) Simulations Apr 12, 2017

PrometheusPi added a commit that referenced this issue Apr 12, 2017

Merge pull request #1963 from ax3l/fix-rngShutdownDC

098dbd2

Fix #1953 RNG Shutdown via DataConnector

ax3l closed this as completed Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double Free for Ionization (RNG-using) Simulations #1953

Double Free for Ionization (RNG-using) Simulations #1953

PrometheusPi commented Apr 6, 2017 •

edited

Loading

PrometheusPi commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017 •

edited

Loading

PrometheusPi commented Apr 7, 2017

n01r commented Apr 7, 2017

PrometheusPi commented Apr 7, 2017

n01r commented Apr 7, 2017 •

edited by ax3l

Loading

n01r commented Apr 7, 2017

psychocoderHPC commented Apr 7, 2017 •

edited by PrometheusPi

Loading

n01r commented Apr 7, 2017

n01r commented Apr 7, 2017

psychocoderHPC commented Apr 12, 2017

psychocoderHPC commented Apr 12, 2017

psychocoderHPC commented Apr 12, 2017

ax3l commented Apr 12, 2017 •

edited

Loading

psychocoderHPC commented Apr 12, 2017

ax3l commented Apr 12, 2017

ax3l commented Apr 13, 2017

Double Free for Ionization (RNG-using) Simulations #1953

Double Free for Ionization (RNG-using) Simulations #1953

Comments

PrometheusPi commented Apr 6, 2017 • edited Loading

PrometheusPi commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

psychocoderHPC commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017

PrometheusPi commented Apr 6, 2017 • edited Loading

PrometheusPi commented Apr 7, 2017

n01r commented Apr 7, 2017

PrometheusPi commented Apr 7, 2017

n01r commented Apr 7, 2017 • edited by ax3l Loading

n01r commented Apr 7, 2017

psychocoderHPC commented Apr 7, 2017 • edited by PrometheusPi Loading

n01r commented Apr 7, 2017

n01r commented Apr 7, 2017

psychocoderHPC commented Apr 12, 2017

psychocoderHPC commented Apr 12, 2017

psychocoderHPC commented Apr 12, 2017

ax3l commented Apr 12, 2017 • edited Loading

psychocoderHPC commented Apr 12, 2017

Solution

ax3l commented Apr 12, 2017

ax3l commented Apr 13, 2017

PrometheusPi commented Apr 6, 2017 •

edited

Loading

PrometheusPi commented Apr 6, 2017 •

edited

Loading

n01r commented Apr 7, 2017 •

edited by ax3l

Loading

psychocoderHPC commented Apr 7, 2017 •

edited by PrometheusPi

Loading

ax3l commented Apr 12, 2017 •

edited

Loading