About static link to cudnn and cublas #5

gangmul12 · 2019-07-13T06:59:30Z

I think nobody managing this repo now, but for future usage.

my env:
OS : Ubuntu 16.04
cudnn : 7.1.4
cuda : 8.0

and i installed pytorch according to the instruction of @cng123. link is below
https://docs.google.com/document/d/17fSM2vrWodP8rWR7ctpgaggVXEw0uD2VCAh0Gi4Gpb4/edit?usp=sharing

I downloaded https://github.com/pytorch/examples, and run imagenet benchmark. Then, I found gpgpu-sim often face deadlock, seg fault, or CUDNN_STATUS_INTERNAL_ERROR
I analyzed this problem by using LD_DEBUG flags, then i found that pytorch library dynamically loads libcuda.so.1, which should not be linked.

I've found two reason why it tried to link to libcuda.so.1 instead of gpgpu-sim's libcudart.so.

In compilation stage for _nvrtc.so, there is a link flag to libcuda.so.

pytorch-gpgpu-sim/setup.py

Line 1010 in 1e37e08

thnvrtc_link_flags += ['-lcudart', '-lcuda', '-lnvrtc']

I can bypass issue from this by using static library of cudnn or deleting lcuda link flag(if you want to use shared version of cudnn). I personally prefer using static library.

libcublas.so has a link to libcuda.so. Strangely, i can't find a explicity linkage libcublas.so to libcuda.so when i check it using ldd command, but LD_DEBUG result shows that libcublas.so calls functions in libcuda.so.
At first, I tried to resolve this issue by making a copy of libcudart.so with the name of libcuda.so.1. However, there are so many unimplemented cuda driver function in cuda_runtime_api.cc, so my terminal generated CUDNN_STATUS_INTERNAL_ERROR very quickly.
Then, I just built pytoch with libcublas_static.a by modifying some cmake value. like

pytorch-gpgpu-sim/CMakeLists.txt

Line 80 in 1e37e08

option(CAFFE2_STATIC_LINK_CUDA "Statically link CUDA libraries" OFF)

pytorch-gpgpu-sim/cmake/public/cuda.cmake

Lines 248 to 251 in 1e37e08

    
           if(CAFFE2_STATIC_LINK_CUDA) 
        
               set_property( 
        
                   TARGET caffe2::cublas PROPERTY INTERFACE_LINK_LIBRARIES 
        
                   "${CUDA_TOOLKIT_ROOT_DIR}/lib64/libcublas_static.a")

Then, many errors were gone.
I also think this is closely related the merged pull-request of gpgpu-sim-distribution, gpgpu-sim/gpgpu-sim_distribution#129

I'm not sure it is meaningful to improve old version of pytorch(ver0.4) but anyway, i hope this issue help your simulation

Thank you

ohcurrent · 2019-10-28T05:27:59Z

Hello gangmul12!
It has been a long time talking to you.

Are you still working on pytorch-gpgpusim?
Did you make some progress running any of the Pytorch examples which fully runs on GPGPU-Sim?

gangmul12 · 2019-10-28T06:50:10Z

Hi! i worked on pytorch-gpgpusim, but i've realized that many kernels in cuDNN library is implemented with only SASS, so now i'm studying SASS in fact XD
However, without failing to simulate SASS only kernel, i've successfully ran(and ignore SASS only kernels) an example!

ohcurrent · 2019-11-05T05:38:21Z

@gangmul12
I see....
The kernels you mentioned about in cuDNN library which only has SASS, does that include "maxwell_sgemm_128x64_raggedMn_tn_splitK" kernel?
Thanks for answering.

gangmul12 · 2019-11-05T06:36:45Z

@ohcurrent
exactly. every kernel named maxwell_something_blahblah does not has PTX code... Also, some kernels have ptx version code, but their function bodies are just {ret;}...

cng123 · 2019-11-05T06:40:13Z

To add onto this, the maxwell_something_xxxx function headers are not in the newer CuDNN versions, so it might have been a mistake that they were there in the first place.

ohcurrent · 2019-11-05T06:53:50Z

@cng123, Then did you simulate with higher version of cuDNN7.1.4 ?

gangmul12 · 2019-11-05T07:11:29Z

@ohcurrent, I saw many cuDNN kernel is optimized to some of its architecture, newer versions have volta_xxxx then it only contains SASS code.
According to a few articles like https://arxiv.org/abs/1804.06826 or https://hal.inria.fr/hal-00789958/document, it seems that there is an optimization that can only be done in SASS level, and can not be provided by nvcc.. I think that is the reason why there are some kernels implemented only in SASS level.

ohcurrent · 2019-11-05T07:24:56Z

@gangmul12
I see, thanks for sharing. I thought that kernel was related to cuBLAS.

Azuresonance · 2020-12-16T11:16:34Z

@gangmul12
This may be off-topic, but I am trying to obtain some information on your fork of this repository (gangmul12/pytorch-v1.1-gpgpu-sim), which unfortunately does not have the Issues tab enabled.

I was trying to build your project with CUDA 8.0 and CUDNN 7.1.3, since versions above this doesn't work according to the developer of GPGPU-Sim (gpgpu-sim/gpgpu-sim_distribution#166 (comment))

After installing, I attempted to run an MNIST example (https://github.com/pytorch/examples/blob/master/mnist/main.py), and got the following output:
Traceback (most recent call last):
File "./main.py", line 139, in
main()
File "./main_original.py", line 128, in main
train(args, model, device, train_loader, optimizer, epoch)
File "./main_original.py", line 42, in train
output = model(data)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "./main.py", line 22, in forward
x = self.conv1(x)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

I would be grateful for some help, whether it's a solution or some hint to narrow the problem down.

gangmul12 · 2020-12-17T07:52:19Z

@Azuresonance
Hi,
At that time, gpgpu-sim version is dev branch of ver. 3x, so i'm not sure where the error is from.
The possible point is..
when you execute any command that is related to cuda, gpgpu sim should be started.
However it seems that gpgpu-sim has not been started.(or you just didn't print gpgpu-sim log?)
maybe it is because you use different cuda version for gpgpu-sim and pytorch, or rpath option is not deleted when you installed pytorch..
It is good start point to check.

itsMaoMao mentioned this issue Oct 23, 2024

Error with RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About static link to cudnn and cublas #5

About static link to cudnn and cublas #5

gangmul12 commented Jul 13, 2019

ohcurrent commented Oct 28, 2019

gangmul12 commented Oct 28, 2019

ohcurrent commented Nov 5, 2019

gangmul12 commented Nov 5, 2019

cng123 commented Nov 5, 2019

ohcurrent commented Nov 5, 2019

gangmul12 commented Nov 5, 2019

ohcurrent commented Nov 5, 2019

Azuresonance commented Dec 16, 2020

gangmul12 commented Dec 17, 2020

About static link to cudnn and cublas #5

About static link to cudnn and cublas #5

Comments

gangmul12 commented Jul 13, 2019

ohcurrent commented Oct 28, 2019

gangmul12 commented Oct 28, 2019

ohcurrent commented Nov 5, 2019

gangmul12 commented Nov 5, 2019

cng123 commented Nov 5, 2019

ohcurrent commented Nov 5, 2019

gangmul12 commented Nov 5, 2019

ohcurrent commented Nov 5, 2019

Azuresonance commented Dec 16, 2020

gangmul12 commented Dec 17, 2020