Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get original SEQRES? #1949

Closed
lokapal opened this issue Sep 9, 2024 · 9 comments · May be fixed by #1950
Closed

How to get original SEQRES? #1949

lokapal opened this issue Sep 9, 2024 · 9 comments · May be fixed by #1950

Comments

@lokapal
Copy link

lokapal commented Sep 9, 2024

I need in SEQRES in the original non-translated to single-letter code, because I need to find non-standard aminoacids that are presenting in some proteins (more frequently they are hydroxyproline and hydroxylysine). I can detect them by scanning .resname from chemical and search for them into original SEQRES. Right now it doesn't work obviously.

from prody import *

chems = parsePDBHeader('1mkp', 'chemicals')
for chem in chems:
    print(chem.resname)

polys = parsePDBHeader('1mkp', 'polymers')
for prot in polys:
    SEQRES=prot.sequence
    print (SEQRES)

I'd like to get in SEQRES the list of strings or at least the one big string with three-letters codes and spaces. Are there means to do that?
Moreover, your forced conversion makes errors:
1BKV:

SEQRES   1 A   30  PRO HYP GLY PRO HYP GLY PRO HYP GLY ILE THR GLY ALA
SEQRES   2 A   30  ARG GLY LEU ALA GLY PRO HYP GLY PRO HYP GLY PRO HYP
SEQRES   3 A   30  GLY PRO HYP GLY
SEQRES   1 B   30  PRO HYP GLY PRO HYP GLY PRO HYP GLY ILE THR GLY ALA
SEQRES   2 B   30  ARG GLY LEU ALA GLY PRO HYP GLY PRO HYP GLY PRO HYP
SEQRES   3 B   30  GLY PRO HYP GLY
SEQRES   1 C   30  PRO HYP GLY PRO HYP GLY PRO HYP GLY ILE THR GLY ALA
SEQRES   2 C   30  ARG GLY LEU ALA GLY PRO HYP GLY PRO HYP GLY PRO HYP

prot.sequence:

PPGPPGPPGITGARGLAGPPGPPGPPGPPG

We know from MODRES, that oxyprolines are at the following places:

MODRES 1BKV HYP A    2  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP A    5  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP A    8  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP A   20  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP A   23  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP A   26  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP A   29  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   32  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   35  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   38  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   50  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   53  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   56  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP B   59  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   62  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   65  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   68  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   80  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   83  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   86  PRO  4-HYDROXYPROLINE
MODRES 1BKV HYP C   89  PRO  4-HYDROXYPROLINE..

ProDy replaces them ALL by normal Proline, that is error obviously

@karolamik13
Copy link
Contributor

Hello,
It will not solve the problem with SEQRES, but a few days ago, I wrote a function in ProDy that might be useful for you in detecting non-standard residues (see below). I can add an additional option to display residue numbers as well. It will be available soon in ProDy in the GitHub version once it is accepted by my colleagues (#1948).

In [19]: p = parsePDB('1BKV')
@> PDB file is found in working directory (1bkv.pdb.gz).
@> 692 atoms and 1 coordinate set(s) were parsed in 0.02s.

In [20]: checkNonstandardResidues(p.select('name CA'))
@> There are several non-standard residues in the structure.
@> Replace the non-standard name in the PDB file with the equivalent name from the standard one if you want to include them in the interactions.
@> Residues: HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP HYP.

@karolamik13
Copy link
Contributor

Actually, I did it already. It is useful to have resid as well.

In [1]: from prody import *

In [2]: p = parsePDB('1BKV')
@> PDB file is found in working directory (1bkv.pdb.gz).
@> 692 atoms and 1 coordinate set(s) were parsed in 0.01s.

In [3]: checkNonstandardResidues(p.select('name CA'))
@> There are several non-standard residues in the structure.
@> Replace the non-standard name in the PDB file with the equivalent name from the standard one if you want to include them in the interactions.
@> Residues: HYP2 HYP5 HYP8 HYP20 HYP23 HYP26 HYP29 HYP32 HYP35 HYP38 HYP50 HYP53 HYP56 HYP59 HYP62 HYP65 HYP68 HYP80 HYP83 HYP86 HYP89.

@jamesmkrieger
Copy link
Contributor

I think we probably should be able to parse the seqres and return the 3 character version, but .getSequence is probably not the answer.

In the 1 letter version, we do alias them as the nearest standard amino acid and also there’s a small annoyance that you get the sequence repeated across for every atom unless you use a Chain object or select ca atoms.

We’ll have a look but think this shouldn’t be too hard

@jamesmkrieger
Copy link
Contributor

For what it's worth, we do have this:

In [11]: polys = parsePDBHeader('1bkv', 'polymers')
@> PDB file is found in working directory (1bkv.pdb).

In [12]: polys[0].modified
Out[12]: 
[('HYP', 'A', '2', 'PRO', '4-HYDROXYPROLINE'),
 ('HYP', 'A', '5', 'PRO', '4-HYDROXYPROLINE'),
 ('HYP', 'A', '8', 'PRO', '4-HYDROXYPROLINE'),
 ('HYP', 'A', '20', 'PRO', '4-HYDROXYPROLINE'),
 ('HYP', 'A', '23', 'PRO', '4-HYDROXYPROLINE'),
 ('HYP', 'A', '26', 'PRO', '4-HYDROXYPROLINE'),
 ('HYP', 'A', '29', 'PRO', '4-HYDROXYPROLINE')]

@jamesmkrieger
Copy link
Contributor

ok, I have a solution at a new pull request that I'm about to make. It will give you a new keyword argument threeLetter than you can use to get your desired behaviour:

In [1]: from prody import *

In [2]: polys = parsePDBHeader('1bkv', 'polymers', threeLetter=True)
@> PDB file is found in working directory (1bkv.pdb).

In [3]: for prot in polys:
   ...:     SEQRES=prot.sequence
   ...:     print (SEQRES)
   ...: 
PRO HYP GLY PRO HYP GLY PRO HYP GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO HYP GLY PRO HYP GLY PRO HYP GLY PRO HYP GLY
PRO HYP GLY PRO HYP GLY PRO HYP GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO HYP GLY PRO HYP GLY PRO HYP GLY PRO HYP GLY
PRO HYP GLY PRO HYP GLY PRO HYP GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO HYP GLY PRO HYP GLY PRO HYP GLY PRO HYP GLY

In [4]: ag = parsePDB('1bkv', compressed=False)
@> PDB file is found in working directory (1bkv.pdb).
@> 692 atoms and 1 coordinate set(s) were parsed in 0.01s.

In [5]: for chain in ag.iterChains():
   ...:     print(chain.getSequence(threeLetter=True))
   ...: 
GLY PRO GLY PRO GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO GLY PRO GLY PRO GLY PRO GLY
PRO GLY PRO GLY PRO GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO GLY PRO GLY PRO GLY PRO GLY
PRO GLY PRO GLY PRO GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO GLY PRO GLY PRO GLY PRO GLY

In [6]: ag.ca.getSequence(threeLetter=True)
Out[6]: 'GLY PRO GLY PRO GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO GLY PRO GLY PRO GLY PRO GLY PRO GLY PRO GLY PRO GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO GLY PRO GLY PRO GLY PRO GLY PRO GLY PRO GLY PRO GLY ILE THR GLY ALA ARG GLY LEU ALA GLY PRO GLY PRO GLY PRO GLY PRO GLY'

@jamesmkrieger
Copy link
Contributor

ok, I think this is now essentially ready. Please feel free to look over the pull request and let us know if we can add anything else and to check out the branch jamesmkrieger:seqres_threeLetter and try it.

@jamesmkrieger
Copy link
Contributor

@lokapal please let us know if these comments and changes meet your requirements

@lokapal
Copy link
Author

lokapal commented Oct 15, 2024

All is fixed right now, as it seems to me. Thank you very much!

@lokapal lokapal closed this as completed Oct 15, 2024
@jamesmkrieger
Copy link
Contributor

You’re welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants