Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atom at the origin cause incorrect res_atom_elements results #296

Open
wanggaa opened this issue Sep 26, 2024 · 1 comment
Open

Atom at the origin cause incorrect res_atom_elements results #296

wanggaa opened this issue Sep 26, 2024 · 1 comment
Labels
good first issue Good for newcomers

Comments

@wanggaa
Copy link

wanggaa commented Sep 26, 2024

Hello lucidrain, firstly thank you very much for the pytorch reproduction of alphafold3.

In file
inputs.py
function
extract_canonical_molecules_from_biomolecule_chains

in some cif_file, certain key ion coordinates are set to the origin, which causes this line of code
res_atom_positions = atom_positions[res_ligand_atom_mask]
can not get the corresponding result correctly, its return
res_atom_elements is null.

when use in later function create_mol_from_atom_positions_and_types, it raise an exception
ValueError: The length of atom_elements and xyz_coordinates must be the same.

You can reproduce this problem using the 1qyl_assembly1.cif file as input, which has two vanadium ions that are each at the origin with 25% probability.

@lucidrains lucidrains added the good first issue Good for newcomers label Sep 29, 2024
@amorehead
Copy link
Contributor

amorehead commented Sep 29, 2024

Hi, @wanggaa. Thanks for your kind words on this project!

My intuition tells me that this issue is caused by said ions having a zero vector for their coordinates, as one can see is possible in the construction of Biomolecule objects here (which are subsequently used to build PDBInputs -> MoleculeInputs -> AtomInputs):

pos[residue_constants.atom_order[atom_name]] = atom.coord

For such ions, their singular atom_mask value is 1, even though their coordinates may be all zeros. Subsequently, as you noticed, in extract_canonical_molecules_from_biomolecule_chains, we filter for only the atom elements in a given molecule that are associated with an atom possessing non-null coordinates. This is what usually causes these element count-coordinates count mismatches to get caught later on, which then cause the PDB structure to be "rejected" by our dataloader and replaced with another example for training/validation.

I've noticed this occurrence for other PDB IDs, and in such cases, it's an open question how best to handle such PDB structures. Ideas or pull requests for better ways to handle such edge cases are very much welcome.

Best,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants