Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double Free for Ionization (RNG-using) Simulations #1953

Closed
PrometheusPi opened this issue Apr 6, 2017 · 28 comments
Closed

Double Free for Ionization (RNG-using) Simulations #1953

PrometheusPi opened this issue Apr 6, 2017 · 28 comments
Assignees
Labels
bug a bug in the project's code component: core in PIConGPU (core application)

Comments

@PrometheusPi
Copy link
Member

PrometheusPi commented Apr 6, 2017

While investigating a PIConGPU crash on taurus using ADK as ionization method I stumbled upon a segmentation fault at the end of the default Laser Wakefield example using the dev version.
In the default setup with ions and ADK it finishes the simulation but runs into a segmentation fault right at the end. With the kernel blocking option, the segmentation fault happens right at start.

I am currently investigating the cause for this crash with cuda-gdb.

Could any one of you (@ax3l, @psychocoderHPC) verify this bug please?
(Not that is is just a bad module combination I use on hypnos and taurus)

Update: (2017-04-07)
It turns out there are two issues:

  • If using sm_20 instead of sm_35, both cuda_memtest and picongpu cause error. Switching to sm_35 solves this issue (this is now moved to issue sm_20 currenly not working for LWFA example #1954)
  • If using atomic Hydrogen, we still see an out of memory error even for very small simulation volumes.

Thus I will rename the topic of this issue to only cover ionization. Please see #1954 for the sm_20 vs sm_35 issue.

Update: (2017-04-12)
Modules used:

  1) gcc/4.9.2
  2) cmake/3.3.0  
  3) boost/1.60.0   
  4) cuda/8.0  
  5) openmpi/1.8.6.kepler.cuda80   
  6) pngwriter/0.5.6
  7) hdf5-parallel/1.8.15

and own libSplash (the current master) at 4aa0c039f98295aa75a490ed4fc4df93ae3c9dac.

@PrometheusPi PrometheusPi added bug a bug in the project's code question labels Apr 6, 2017
@PrometheusPi PrometheusPi added this to the Future milestone Apr 6, 2017
@PrometheusPi PrometheusPi self-assigned this Apr 6, 2017
@PrometheusPi
Copy link
Member Author

With the "hack" in #1951 I git the following backtrace:

#0  0x00007ffff48a4c37 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff48a8028 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff50be9ed in __gnu_cxx::__verbose_terminate_handler() () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#3  0x00007ffff50bc986 in __cxxabiv1::__terminate(void (*)()) () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#4  0x00007ffff50bc9d1 in std::terminate() () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#5  0x00007ffff50bcc18 in __cxa_throw () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#6  0x00000000009e1b76 in PMacc::exec::KernelStarter<PMacc::exec::Kernel<PMacc::random::kernel::InitRNGProvider<PMacc::random::methods::XorMin> >, unsigned int, unsigned int>::operator()<PMacc::DataBox<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >, unsigned int, PMacc::DataSpace<3u> > (this=this@entry=0x7fffffffc9c0)
    at ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp:217
#7  0x00000000009e24f9 in operator()<PMacc::DataBox<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >, unsigned int, PMacc::DataSpace<3u> > (this=0x7fffffffc990)
    at ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp:241
#8  PMacc::random::RNGProvider<3u, PMacc::random::methods::XorMin>::init (this=0x7fffe8179520, seed=0)
    at ___/src/libPMacc/include/random/RNGProvider.tpp:76
#9  0x0000000000a1296d in picongpu::MySimulation::init (this=0x217f0f0)
    at ___/src/picongpu/include/simulationControl/MySimulation.hpp:326
#10 0x00000000009b36d8 in PMacc::SimulationHelper<3u>::startSimulation (this=0x217f0f0)
    at ___/src/libPMacc/include/simulationControl/SimulationHelper.hpp:215
#11 0x00000000009b3f04 in picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::start (this=this@entry=0x7fffffffd1c0)
    at ___/src/picongpu/include/simulationControl/SimulationStarter.hpp:81
#12 0x00000000008b81d9 in main (argc=11, argv=0x7fffffffd2f8)
    at ___/src/picongpu/main.cu:56

with ___ being the path to the source

@psychocoderHPC
Copy link
Member

Please run with blocking kernek and cuda-memtest.

@psychocoderHPC
Copy link
Member

Could you please also provide the error message. Something like invalid memory access or so.

@PrometheusPi
Copy link
Member Author

The function void RNGProvider<T_dim, T_RNGMethod>::init(uint32_t seed) is called several times (while according to gdb, not all calls start a kernel).
The value of gridSize is optimized out till the crash, but right before the crash it is set to 16384.

Thus from the kernel call:

PMACC_KERNEL(kernel::InitRNGProvider<RNGMethod>{})
    (gridSize, blockSize)
    (bufferBox, seed, m_size);

with:

  • gridSize = 16384
  • blockSize = 256
  • bufferBox = {<PMacc::private_Box::Box<3u, PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >> = {<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u>> = {pitch = 3072, pitch2D = 786432, fixedPointer = 0x131f980000}, <No data fields>}, <No data fields>}
  • seed = 0
  • m_size = {<PMacc::math::Vector<int, 3, PMacc::math::StandardAccessor, PMacc::math::StandardNavigator, PMacc::math::detail::Vector_components>> = {<PMacc::math::detail::Vector_components<int, 3>> = {static isConst = <optimized out>, static dim = 3, v = {128, 256, 128}}, <PMacc::math::StandardAccessor> = {<No data fields>}, <PMacc::math::StandardNavigator> = {<No data fields>}, static dim = 3}, static Dim = <optimized out>}

Surprisingly with the first two values and the previously called line of code

const uint32_t gridSize = (m_size.productOfComponents() + blockSize - 1u) / blockSize; // Round up

m_size.productOfComponents() = 4194049
which is a number with the prime factors 53 * 79133 which looks kind of weird for the typical data structure used in PIConGPU.
Further it makes absolutely no sense if this is a product of more than 2 components (and m_size is 3 dimensional).

@PrometheusPi
Copy link
Member Author

@psychocoderHPC

When running cudamemtest, there is an error (on various k20/hypnos nodes)

mpiexec --prefix $MPIHOME -tag-output --display-map -x LIBRARY_PATH -x LD_LIBRARY_PATH -am  .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/cuda_memtest.sh
 Data for JOB [59174,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004	Num slots: 4	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [59174,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:[04/06/2017 21:55:21][kepler004][0]:ERROR: CUDA error: invalid device function, line 312, file ___/thirdParty/cuda_memtest/tests.cu
[1,0]<stderr>:cuda_memtest crash: see file ___/002_test_1GPU/simOutput/cuda_memtest_kepler004_0.err
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[59174,1],0]
  Exit code:    1
--------------------------------------------------------------------------

When running picongpu with blocking kernel on, the following error message is given:

mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -tag-output --display-map -am  .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/picongpu  -d 1 1 1                         -g 128 256 128                        -s 10
 Data for JOB [58224,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004	Num slots: 4	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [58224,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
[1,0]<stderr>:[CUDA] Error: < ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N5PMacc6random6kernel15InitRNGProviderINS0_7methods6XorMinEEE [ ___/src/libPMacc/include/random/RNGProvider.tpp:76 ]
[1,0]<stderr>:terminate called after throwing an instance of 'std::runtime_error'
[1,0]<stderr>:  what():  [CUDA] Error: invalid device function
[1,0]<stderr>:[kepler004:25014] *** Process received signal ***
[1,0]<stderr>:[kepler004:25014] Signal: Aborted (6)
[1,0]<stderr>:[kepler004:25014] Signal code:  (-6)
[1,0]<stderr>:[kepler004:25014] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7ff4adbfc330]
[1,0]<stderr>:[kepler004:25014] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7ff4aa8d4c37]
[1,0]<stderr>:[kepler004:25014] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7ff4aa8d8028]
[1,0]<stderr>:[kepler004:25014] [ 3] [1,0]<stderr>:/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7ff4ab0ee9ed]
[1,0]<stderr>:[kepler004:25014] [ 4] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d986)[0x7ff4ab0ec986]
[1,0]<stderr>:[kepler004:25014] [ 5] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d9d1)[0x7ff4ab0ec9d1]
[1,0]<stderr>:[kepler004:25014] [ 6] [1,0]<stderr>:/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5dc18)[0x7ff4ab0ecc18]
[1,0]<stderr>:[kepler004:25014] [ 7] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_+0x706)[0x9e1b76]
[1,0]<stderr>:[kepler004:25014] [ 8] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj+0x109)[0x9e24f9]
[1,0]<stderr>:[kepler004:25014] [ 9] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu12MySimulation4initEv+0x5ad)[0xa1296d]
[1,0]<stderr>:[kepler004:25014] [10] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv+0x18)[0x9b36d8]
[1,0]<stderr>:[kepler004:25014] [11] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xc4)[0x9b3f04]
[1,0]<stderr>:[kepler004:25014] [12] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(main+0x99)[0x8b81d9]
[1,0]<stderr>:[kepler004:25014] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7ff4aa8bff45]
[1,0]<stderr>:[kepler004:25014] [14] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu[0x8b856f]
[1,0]<stderr>:[kepler004:25014] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 25014 on node kepler004 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

without blocking kernel the error message differs:

mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -tag-output --display-map -am .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/picongpu  -d 1 1 1                         -g 128 256 128                        -s 10
 Data for JOB [63601,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004	Num slots: 4	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [63601,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000578698
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 16777216
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
[1,0]<stdout>:initialization time:  1sec 934msec = 1 sec
[1,0]<stdout>:  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
[1,0]<stdout>: 10 % =        1 | time elapsed:                    4msec | avg time per step:   4msec
[1,0]<stdout>: 20 % =        2 | time elapsed:                    6msec | avg time per step:   2msec
[1,0]<stdout>: 30 % =        3 | time elapsed:                    8msec | avg time per step:   2msec
[1,0]<stdout>: 40 % =        4 | time elapsed:                   11msec | avg time per step:   2msec
[1,0]<stdout>: 50 % =        5 | time elapsed:                   13msec | avg time per step:   2msec
[1,0]<stdout>: 60 % =        6 | time elapsed:                   15msec | avg time per step:   2msec
[1,0]<stdout>: 70 % =        7 | time elapsed:                   18msec | avg time per step:   2msec
[1,0]<stdout>: 80 % =        8 | time elapsed:                   20msec | avg time per step:   2msec
[1,0]<stdout>: 90 % =        9 | time elapsed:                   22msec | avg time per step:   2msec
[1,0]<stdout>:100 % =       10 | time elapsed:                   25msec | avg time per step:   2msec
[1,0]<stdout>:calculation  simulation time:  25msec = 0 sec
[1,0]<stderr>:[kepler004:31414] *** Process received signal ***
[1,0]<stderr>:[kepler004:31414] Signal: Segmentation fault (11)
[1,0]<stderr>:[kepler004:31414] Signal code: Address not mapped (1)
[1,0]<stderr>:[kepler004:31414] Failing at address: 0x31
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 31414 on node kepler004 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@psychocoderHPC
Copy link
Member

Oh sry my fault I mean cuda-memcheck. This should show the wronh line within the gpu kernel.

@psychocoderHPC
Copy link
Member

Maybe it is a issue triggered by our new DataConnector or the change within the Environment. We should check it next week.

@PrometheusPi
Copy link
Member Author

result from cuda-memtest

cuda-memcheck  .../002_test_1GPU/picongpu/bin/picongpu -d 1 1 1   -g 128 256 128    -s 10
========= CUDA-MEMCHECK
========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaSetDeviceFlags. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ca610]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6detail18EnvironmentContext9setDeviceEi + 0x1fd) [0x4d68dd]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc11EnvironmentILj3EE11initDevicesENS_9DataSpaceILj3EEES3_ + 0x9c) [0x5edd8c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation10pluginLoadEv + 0x1bc) [0x5f501c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE10pluginLoadEv + 0x2b) [0x4fc4fb]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x8c) [0x4b81cc]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ba703]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6detail18EnvironmentContext9setDeviceEi + 0x267) [0x4d6947]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc11EnvironmentILj3EE11initDevicesENS_9DataSpaceILj3EEES3_ + 0x9c) [0x5edd8c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation10pluginLoadEv + 0x1bc) [0x5f501c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE10pluginLoadEv + 0x2b) [0x4fc4fb]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x8c) [0x4b81cc]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6bcfde]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6nvidia16gpuEntryFunctionINS_6random6kernel15InitRNGProviderINS2_7methods6XorMinEEEINS_7DataBoxINS_10PitchedBoxINS6_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvT_DpT0_ + 0x8b) [0x4e7b6b]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_ + 0x34f) [0x5e17bf]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj + 0x109) [0x5e24f9]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation4initEv + 0x5ad) [0x61296d]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv + 0x18) [0x5b36d8]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv + 0xc4) [0x5b3f04]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x99) [0x4b81d9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ba703]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_ + 0x29f) [0x5e170f]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj + 0x109) [0x5e24f9]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation4initEv + 0x5ad) [0x61296d]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv + 0x18) [0x5b36d8]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv + 0xc4) [0x5b3f04]
[CUDA] Error: <=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x99) [0x4b81d9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
 ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N5PMacc6random6kernel15InitRNGProviderINS0_7methods6XorMinEEE [ ___/src/libPMacc/include/random/RNGProvider.tpp:76 ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] Error: invalid device function
[kepler004:03512] *** Process received signal ***
[kepler004:03512] Signal: Aborted (6)
[kepler004:03512] Signal code:  (-6)
[kepler004:03512] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f31e092c330]
[kepler004:03512] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f31dd604c37]
[kepler004:03512] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f31dd608028]
[kepler004:03512] [ 3] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7f31dde1e9ed]
[kepler004:03512] [ 4] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d986)[0x7f31dde1c986]
[kepler004:03512] [ 5] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d9d1)[0x7f31dde1c9d1]
[kepler004:03512] [ 6] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5dc18)[0x7f31dde1cc18]
[kepler004:03512] [ 7]  .../002_test_1GPU/picongpu/bin/picongpu(_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_+0x706)[0x9e1b76]
[kepler004:03512] [ 8]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj+0x109)[0x9e24f9]
[kepler004:03512] [ 9]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu12MySimulation4initEv+0x5ad)[0xa1296d]
[kepler004:03512] [10]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv+0x18)[0x9b36d8]
[kepler004:03512] [11]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xc4)[0x9b3f04]
[kepler004:03512] [12]  .../002_test_1GPU/picongpu/bin/picongpu(main+0x99)[0x8b81d9]
[kepler004:03512] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f31dd5eff45]
[kepler004:03512] [14]  .../002_test_1GPU/picongpu/bin/picongpu[0x8b856f]
[kepler004:03512] *** End of error message ***
========= Error: process didn't terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

@psychocoderHPC
Copy link
Member

Have you enabled the compile flag "show codeline". If not please enable it or do not delete the binary you used. We can extract the line number out of it.

@psychocoderHPC
Copy link
Member

Do you changed the architecture to sm_35? If not than this is the error. There is a bug in the cmake file that the ptx code is not embadded and if you used the wrong architecture a error like this can be triggered.
I will fix the cmake file next week.

@PrometheusPi
Copy link
Member Author

setting 35 causes an error:

  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
[CUDA] Error: < ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N8picongpu9particles10ionization21KernelIonizeParticlesE [ ___/src/picongpu/include/particles/ParticlesFunctors.hpp:313 ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] Error: out of memory

and is extremely slow.

@PrometheusPi
Copy link
Member Author

However, the cuda_memtest.sh runs now successfully.

But even .../002_test_1GPU/picongpu/bin/picongpu -d 1 1 1 -g 32 32 32 -s 1 crashes with the above memory error.

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Apr 6, 2017

Using sm_35 solves the issue with running the examples with electrons only and electrons+ions (pre-ionized). But with ions both BSI and ADK crash.

@PrometheusPi
Copy link
Member Author

Since this is a issue with the ionization I would like to mention @n01r.

@PrometheusPi PrometheusPi changed the title Segmentation fault at end of Laser Wakefield example Out-of-memory in Laser Wakefield example with ionization Apr 7, 2017
@ax3l ax3l modified the milestones: Next Stable: 0.3.0, Future Apr 7, 2017
@ax3l ax3l added component: core in PIConGPU (core application) component: examples PIConGPU or PMacc examples and removed question labels Apr 7, 2017
@n01r
Copy link
Member

n01r commented Apr 7, 2017

I just did a short test on a k80 node.
Configured a LaserWakefield example from dev with -t 10 which activates ions and ionization. I used BSIHydrogenLike and kicked out the effectiveAtomicNumbers for this test.
I ran mpiexec -n 8 picongpu -d 2 2 2 -g 64 64 64 -s 2000 and received no errors.

@PrometheusPi
Copy link
Member Author

@n01r How did you activate ionization in your simulation?
with PARAM_IONIZATION == 1 ?

@n01r
Copy link
Member

n01r commented Apr 7, 2017

$PICSRC/configure -t 10 ~/paramSets/088_Issue1953BugOutOfMemoryWithIonization
whereas the cmake flag set 10 contains
flags[10]="-DCUDA_ARCH=35 -DPARAM_OVERWRITES:LIST=-DPARAM_IONS=1;-DPARAM_IONIZATION=1"

edit by @ax3l: and manually setting BSIEffectiveZ to ADKLinPol: #1960 (comment)

@n01r
Copy link
Member

n01r commented Apr 7, 2017

I repeated the test with ADK w/o PMACC_BLOCKING_KERNEL.
This time I get an error:

 95 % =     1900 | time elapsed:            14sec 637msec | avg time per step:   7msec
100 % =     2000 | time elapsed:            15sec 390msec | avg time per step:   7msec
calculation  simulation time: 15sec 391msec = 15 sec
[kepler020:22371] *** Process received signal ***
[kepler020:22371] Signal: Segmentation fault (11)
[kepler020:22371] Signal code: Address not mapped (1)
[kepler020:22371] Failing at address: 0x30
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 22371 on node kepler020 exited on signal 11 (Segmentation fault).

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Apr 7, 2017

please run on one gpu and try to reproduce it. After that please run with cuda-gdb and print out the backtrace.

@n01r
Copy link
Member

n01r commented Apr 7, 2017

Yeah, I know the drill - already at it.
I only posted the 8 GPU test and did already 1 GPU tests for both and now I'm compiling with the debug flags.

@n01r
Copy link
Member

n01r commented Apr 7, 2017

@psychocoderHPC Do you remember that I entered your office a couple of weeks ago to tell you the same thing, that after a completed simulation it would crash? You told me it is a known issue with freeing memory in the cleanup step.

I did not try with sm_20 so far, though. Only with sm_35.

@psychocoderHPC
Copy link
Member

@n01r The cleanup issues you mean was fixed with #1886. It was during the time @ax3l refactored the PMacc::DataConnector

@psychocoderHPC
Copy link
Member

@PrometheusPi: I can reproduce the error with the current dev (LWFA plane) and it is fixed in my test in #1960 (see this test

@psychocoderHPC
Copy link
Member

This pull request mix two different bugs:

  1. compiling LWFA with sm_20 and run on sm_3x result in [CUDA] Error: invalid device function
  2. compiling LWFA+ADK with sm_35 and run on sm_35 result invalid memory access

The point 1. is also in addressed in #1954 and fixed with #1960.

From this point we use this issue only to discuss the ADK error

@ax3l
Copy link
Member

ax3l commented Apr 12, 2017

the ADK problem looks a bit like a double free or use after free of the RNG. I will check the output of the new data connector in verbose mode with it:

[...]
calculation  simulation time: 102msec = 0 sec
PMaccVerbose MEMORY(1) | DataConnector: unshared 'MallocMCBuffer' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: being cleaned (7 datasets left to unshare)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'i' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'e' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'RNGProvider3XorMin' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'FieldTmp0' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'J' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'E' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'B' (0 uses left)
[kepler020:08170] *** Process received signal ***
[kepler020:08170] Signal: Segmentation fault (11)
[kepler020:08170] Signal code: Address not mapped (1)
[kepler020:08170] Failing at address: 0x35
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 8170 on node kepler020 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@psychocoderHPC
Copy link
Member

@ax3l Yes the issue is triggered by the data connector because of wrong owner ships.

The plain pointer of RNGProvider is deleted in here in MySimulation and later on the shared pointer is also freed from the DataConnector.

Solution

We need to hold the RNGProvider in MySimulation as shared Pointer and must trigger the share function from MySimulation and not from the class it self.

Note: I am currently sick and can not address this issue within the next 2 week.

@ax3l
Copy link
Member

ax3l commented Apr 12, 2017

yes, I just posted the same 1minute ago above :D currently testing...

@ax3l ax3l assigned ax3l and unassigned PrometheusPi Apr 12, 2017
ax3l added a commit to ax3l/picongpu that referenced this issue Apr 12, 2017
Our RNGFactory should be shared with the `DataConnector` within
MySimulation and should not share itself in its constructor.
@ax3l ax3l removed the component: examples PIConGPU or PMacc examples label Apr 12, 2017
@ax3l ax3l changed the title Out-of-memory in Laser Wakefield example with ionization Double Free for Ionization (RNG-using) Simulations Apr 12, 2017
PrometheusPi added a commit that referenced this issue Apr 12, 2017
Fix #1953 RNG Shutdown via DataConnector
@ax3l
Copy link
Member

ax3l commented Apr 13, 2017

should be fixed with #1963

@ax3l ax3l closed this as completed Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug a bug in the project's code component: core in PIConGPU (core application)
Projects
None yet
Development

No branches or pull requests

4 participants