Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error generating rxn_maps due to mismatch in array size #56

Closed
starkAhmed43 opened this issue Sep 10, 2024 · 6 comments
Closed

Error generating rxn_maps due to mismatch in array size #56

starkAhmed43 opened this issue Sep 10, 2024 · 6 comments

Comments

@starkAhmed43
Copy link

Hello,

I am trying to generate atom mappings for some 30,000 reaction SMILES. These reaction SMILES are generated for the reactions from the Rhea database. I download the mol files of the participating metabolites for ChEBI, generate SMILES using RDKIT and then concatenate them using . and >> to generate reaction SMILES.

Using, BatchedMapper I am able to successfully generate atom maps for 24,600 of them. For the remaining 5000ish, when I use RxnMapper to identify the cause of error I get a few different error types:

Error: index X is out of bounds for axis 0 with size X
Error: could not broadcast input array from shape (A,) into shape (B,)
Error: The size of tensor a (D) must match the size of tensor b (512) at non-singleton dimension 1

Error 2 accounts for the vast majority of unsuccessful maps.

I am from a data science background and have zero chemistry knowledge and thus, while I understand the error messages, I do not understand what is causing them.

It would be really great if the authors of rxnMapper could help me out here. Please let me know if you need more details regarding the errors I am getting.

@avaucher
Copy link
Member

Hi,
I agree that the error messages are sometimes cryptic. Is it possible for you to share a few of these reactions?

My best guess is that the reaction SMILES strings are too long for the current model, but happy to check directly.

@starkAhmed43
Copy link
Author

starkAhmed43 commented Sep 18, 2024

Hi. Thank you so much for helping me out with this issue. Here are some SMILES examples for the errors I've mentioned above:

Error 1: index X is out of bounds for axis 0 with size X

[1*][C@@H]1O[C@H](CO)[C@@H](OP(=O)([O-])[O-])[C@H]1O.O>>*[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O.O=P([O-])([O-])O

[1*]C(=O)C(=O)[O-].[H+].NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1.[NH4+]>>*[C@H]([NH3+])C(=O)[O-].O.NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1

[1*]C([2*])=O.[H+].NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1>>*C(*)O.NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1

[NH3+][C@H]1[C@@H](O[C@H]2[C@H](O)[C@H](O)[C@H](O)[C@@H](O)[C@@H]2O)O[C@H](CO)[C@@H](O)[C@@H]1O.[1*]C(=O)OC[C@H](COP(=O)([O-])[O-])OC([2*])=O.[H+]>>*C(=O)OC[C@H](COP(=O)([O-])O[C@@H]1[C@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@H]1O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1[NH3+])OC(*)=O.O

Error 2: could not broadcast input array from shape (A,) into shape (B,)

*N[C@@H](CCC(N)=O)C(=O)[O-].O>>*N[C@@H](CCC(=O)[O-])C(=O)[O-].[NH4+]

*[C@H]1C[C@H](OP(=O)([O-])[O-])[C@@H](CO)O1.O>>*[C@H]1C[C@H](O)[C@@H](CO)O1.O=P([O-])([O-])O

*N[C@@H](COP(=O)([O-])OCC(C)(C)[C@@H](O)C(=O)NCCC(=O)NCCSC(*)=O)C(*)=O.C[S+](CC[C@H]([NH3+])C(=O)[O-])C[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O>>*C(=O)N[C@H]1CCOC1=O.[H+].*N[C@@H](COP(=O)([O-])OCC(C)(C)[C@@H](O)C(=O)NCCC(=O)NCCS)C(*)=O.CSC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O

*C(=O)OC1CCC2(C)C(CCC3C4CCC(*)C4(C)CCC32)C1.O>>*C(=O)[O-].*C1CCC2C3CCC4CC(O)CCC4(C)C3CCC12C.[H+]

Error 3: The size of tensor a (D) must match the size of tensor b (512) at non-singleton dimension 1

CC(=O)N[C@H]1[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)CO)[C@H](O)[C@H]2NC(C)=O)[C@@H]1O.O=c1ccn([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])O[C@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@H]3O)[C@@H](O)[C@H]2O)c(=O)[nH]1>>CC(=O)N[C@H]1[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)COP(=O)([O-])OC[C@@H](O[C@@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@H]3O)[C@@H](O)[C@@H](O)CO)[C@H](O)[C@H]2NC(C)=O)[C@@H]1O.[H+].O=c1ccn([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]2O)c(=O)[nH]1

CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1OP(=O)([O-])[O-])[C@@H](O)C(=O)NCCC(=O)NCCSC(=O)C=Cc1ccc(O)cc1.[H+].CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1OP(=O)([O-])[O-])[C@@H](O)C(=O)NCCC(=O)NCCSC(=O)CC(=O)[O-].NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](OP(=O)([O-])[O-])[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1>>O=C=O.CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1OP(=O)([O-])[O-])[C@@H](O)C(=O)NCCC(=O)NCCS.O.O=C(/C=C/c1ccc(O)cc1)c1ccc(O)cc1[O-].NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](OP(=O)([O-])[O-])[C@@H]3O)[C@@H](O)[C@H]2O)c1

Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.CC1=C2[N-][C@H]([C@H](CC(=O)[O-])[C@@]2(C)CCC(=O)[O-])[C@]2(C)N=C(C(C)=C3N=C(C=C4N=C1[C@@H](CCC(=O)[O-])C4(C)C)[C@@H](CCC(=O)[O-])[C@]3(C)CC(N)=O)[C@@H](CCC(=O)[O-])[C@]2(C)CC(N)=O.[Co+2].Cc1cc2c(cc1C)N(C[C@H](O)[C@H](O)[C@H](O)COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n3cnc4c(N)ncnc43)[C@H](O)[C@@H]1O)c1[nH]c(=O)[nH]c(=O)c1N2>>C/C1=C2/[N-][C@H]([C@H](CC(=O)[O-])[C@@]2(C)CCC(=O)[O-])[C@]2(C)N=C(/C(C)=C3\\N=C(/C=C4\\N=C1[C@@H](CCC(=O)[O-])C4(C)C)[C@@H](CCC(=O)[O-])[C@]3(C)CC([NH-])=O)[C@@H](CCC(=O)[O-])[C@]2(C)CC(N)=O.C[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O.[Co+3].[H+].Cc1cc2nc3c(=O)[n-]c(=O)nc-3n(C[C@H](O)[C@H](O)[C@H](O)COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)c2cc1C.O=P([O-])([O-])OP(=O)([O-])OP(=O)([O-])[O-]

CC(=O)N[C@@H]1[C@H](O[C@H]2[C@H](O)[C@@H](NC(C)=O)[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@@H]2CO)O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@H](O)CO)[C@@H]1O.Nc1ccn([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)CO)[C@@H](O)[C@H]2O)c(=O)n1>>CC(=O)N[C@H]1[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)CO)[C@H](O)[C@H]2NC(C)=O)[C@@H]1O.Nc1ccn([C@@H]2O[C@H](COP(=O)([O-])[O-])[C@@H](O)[C@H]2O)c(=O)n1.[H+]

@avaucher
Copy link
Member

Thanks for the examples!

I started looking into it and will shortly have a fix for Errors 1 and 2. This was caused by the presence of asterisks in the reaction SMILES, which sometimes stand for atom placeholders. After the fix, these reactions should succeed.

Error 3 seems to be caused by too long reaction SMILES. I'll improve the error message for these.

@avaucher
Copy link
Member

Errors 1 and 2 fixed by #57.

@avaucher
Copy link
Member

Error message for 3 improved in #58.

@avaucher
Copy link
Member

The new version on PyPi (0.4.0) should cover all the above! Feel free to reopen this issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants