HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

zepingWww · 2024-06-05T07:27:06Z

我这边想使用最新paddlepaddle-gpu对HelixFold进行推理，按照README_inference.md进行适配，其中ppfleetx在paddle 2.6.1基础上进行升级适配，其余步骤与文档一致。

环境

python: 3.8
PaddlePaddle：2.6.1
cuda: 12.0
cudnn: 8.9.1
GPU配置：A800 80G * 8

问题

在适配HelixFold for Ultra-Long Monomer Protein Demo时，推理到self.evoformer这一层直接无报错信息直接crash。

请问一下这层DistEmbeddingsAndEvoformer在paddle 2.6上是否有相同作用的layer的替换，或者能否对这层进行改动实现正常推理。

这个问题在CASP14 Demo上也存在，但是可以通过替换DistEmbeddingsAndEvoformer为EmbeddingsAndEvoformer绕过。这里替换后报错如下：

terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   add_ad_func(paddle::Tensor const&, paddle::Tensor const&)
1   add_ad_func(paddle::Tensor const&, paddle::Tensor const&)
2   paddle::experimental::add(paddle::Tensor const&, paddle::Tensor const&)
3   void phi::AddRawKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, int, phi::DenseTensor*)
4   phi::dtype::bfloat16* phi::DeviceContext::Alloc<phi::dtype::bfloat16>(phi::TensorBase*, unsigned long, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
6   paddle::memory::allocation::Allocator::Allocate(unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
9   paddle::memory::allocation::StreamSafeCUDAAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
11  paddle::memory::allocation::AutoGrowthBestFitAllocator::FreeIdleChunks()
12  paddle::memory::allocation::CUDAAllocator::FreeImpl(phi::Allocation*)
13  phi::enforce::EnforceNotMet::EnforceNotMet(common::ErrorSummary const&, char const*, int)
14  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ExternalError: CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ../paddle/fluid/platform/device/gpu/gpu_info.cc:269)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

zepingWww commented Jun 5, 2024 •

edited

Loading

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

Comments

zepingWww commented Jun 5, 2024 • edited Loading

环境

问题

zepingWww commented Jun 5, 2024 •

edited

Loading