Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

Open
zepingWww opened this issue Jun 5, 2024 · 0 comments
Open

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

zepingWww opened this issue Jun 5, 2024 · 0 comments

Comments

@zepingWww
Copy link

zepingWww commented Jun 5, 2024

我这边想使用最新paddlepaddle-gpu对HelixFold进行推理,按照README_inference.md进行适配,其中ppfleetx在paddle 2.6.1基础上进行升级适配,其余步骤与文档一致。

环境

  • python: 3.8
  • PaddlePaddle:2.6.1
  • cuda: 12.0
  • cudnn: 8.9.1
  • GPU配置:A800 80G * 8

问题

在适配HelixFold for Ultra-Long Monomer Protein Demo时,推理到self.evoformer这一层直接无报错信息直接crash

请问一下这层DistEmbeddingsAndEvoformer在paddle 2.6上是否有相同作用的layer的替换,或者能否对这层进行改动实现正常推理。

  • 这个问题在CASP14 Demo上也存在,但是可以通过替换DistEmbeddingsAndEvoformer为EmbeddingsAndEvoformer绕过。这里替换后报错如下:
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   add_ad_func(paddle::Tensor const&, paddle::Tensor const&)
1   add_ad_func(paddle::Tensor const&, paddle::Tensor const&)
2   paddle::experimental::add(paddle::Tensor const&, paddle::Tensor const&)
3   void phi::AddRawKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, int, phi::DenseTensor*)
4   phi::dtype::bfloat16* phi::DeviceContext::Alloc<phi::dtype::bfloat16>(phi::TensorBase*, unsigned long, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
6   paddle::memory::allocation::Allocator::Allocate(unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
9   paddle::memory::allocation::StreamSafeCUDAAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
11  paddle::memory::allocation::AutoGrowthBestFitAllocator::FreeIdleChunks()
12  paddle::memory::allocation::CUDAAllocator::FreeImpl(phi::Allocation*)
13  phi::enforce::EnforceNotMet::EnforceNotMet(common::ErrorSummary const&, char const*, int)
14  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ExternalError: CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ../paddle/fluid/platform/device/gpu/gpu_info.cc:269)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant