Cuda error 77, please suggest how to debug #68

sergei-mironov · 2020-08-24T14:52:24Z

Hi. I applied TASO (commit dce8c4d) to the several models of onnx-models. One frequent error I got is:

Cuda failure: 77
/workspace/taso/src/cudnn/element_kernel.cu:242
Aborting...

Affected models are: inception-v2-9, mnist-8.log, resnet101-v2-7, resnet18-v2-7, roberta-base-11, shufflenet-9, vgg19-7, yolov4.
Since trivial mnist is in the list, I suspect that the problem was caused by some environment bug, such as package version mismatch or alike.

The error message is not very verbose, and named CUDA line doesn't look suspicious. I would be glad to provide more debugging information but unfortunately I'm not a expert in low-level CUDA. Could you please suggest what can I do to collect more information?

The text was updated successfully, but these errors were encountered:

Alex-Sol · 2020-10-20T13:42:12Z

You could modify ele->use_kernel() to true to use elementwise kernel in cudnn. This method can avoid this bug.

bool Element::use_kernel(void) const {
    switch (type) {
        case OP_EW_ADD:
            return true;
        case OP_EW_MUL:
        case OP_EW_MAX:
        case OP_EW_MIN:
            break;
        default:
            return false;
    }
......

jiahuiyang · 2020-11-18T08:15:29Z

Hi, @Alex-Sol ,
I met the same problem as @grwlf . After changed Element::use_kernel function, I faced following problem in resnet50.

/home/TASO/src/cudnn/cuda_helper.cu:83: void helperSetBroadcastableTensorDescriptor(const taso::Tensor&, const taso::Tensor&, cudnnTensorDescriptor_t): Assertion `input.default_layout()' failed.
Aborted (core dumped)

Could you help me to solve this problem?

jiahuiyang · 2020-11-19T02:32:16Z

I found the problem is related to gemm opearter. If I comment some code in init.py like following, I don't have layout problem. But still I need to know how to solve it perfectly. @Alex-Sol

def _gemm(op, graph, tensors, initializer):
inputs = _get_inputs(op, graph, tensors, initializer)
attrs = _parse_attribute(op.attribute)
if "transA" in attrs and attrs["transA"] == 1:
inputs[0] = graph.transpose(inputs[0], (1,0), shuffle=True)
if "transB" in attrs and attrs["transB"] == 1:
inputs[1] = graph.transpose(inputs[1], (1,0), shuffle=True)
outputs = graph.matmul(inputs[0], inputs[1])
# if len(inputs) > 2:
# outputs = graph.add(outputs, inputs[2])
return outputs

Alex-Sol · 2020-11-24T10:59:02Z

@jiahuiyang This may be a bug about mismatch of dims of bias and matmul.
I have fixed this bug like this in python/taso/__init__.py:

if len(inputs) > 2:
        dim = inputs[2].dim(0)
        reshape_bias = graph.reshape(inputs[2], (1,dim))
        outputs = graph.add(outputs, reshape_bias)
return outputs

jiahuiyang · 2020-11-28T02:17:47Z

@jiahuiyang This may be a bug about mismatch of dims of bias and matmul.
I have fixed this bug like this in python/taso/__init__.py:
if len(inputs) > 2:
        dim = inputs[2].dim(0)
        reshape_bias = graph.reshape(inputs[2], (1,dim))
        outputs = graph.add(outputs, reshape_bias)
return outputs

great. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda error 77, please suggest how to debug #68

Cuda error 77, please suggest how to debug #68

sergei-mironov commented Aug 24, 2020

Alex-Sol commented Oct 20, 2020 •

edited

Loading

jiahuiyang commented Nov 18, 2020 •

edited

Loading

jiahuiyang commented Nov 19, 2020

Alex-Sol commented Nov 24, 2020

jiahuiyang commented Nov 28, 2020

Cuda error 77, please suggest how to debug #68

Cuda error 77, please suggest how to debug #68

Comments

sergei-mironov commented Aug 24, 2020

Alex-Sol commented Oct 20, 2020 • edited Loading

jiahuiyang commented Nov 18, 2020 • edited Loading

jiahuiyang commented Nov 19, 2020

Alex-Sol commented Nov 24, 2020

jiahuiyang commented Nov 28, 2020

Alex-Sol commented Oct 20, 2020 •

edited

Loading

jiahuiyang commented Nov 18, 2020 •

edited

Loading