Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peptidoform output from Proteome Discoverer #510

Merged
merged 6 commits into from
Dec 19, 2024

Conversation

mlocardpaulet
Copy link
Contributor

I added a test file in test/data/dda_quant/ProteomeDIscoverer_PeptideGroups_trimmed.txt. It is trimmed, and has all possible columns that can be exported from this level.

@mlocardpaulet
Copy link
Contributor Author

mlocardpaulet commented Dec 17, 2024

From what I understand:

  • there is one peptidoform per row
  • stripped sequences are in Sequence
  • modifications are in Modifications
  • all protein accessions matching a given peptidoform and the species are in the column Protein Accessions (this requires a specific parsing rule of the fasta within Protein Discoverer that I'll need to add in the documentation)
  • multiple accessions are separated with ";"
  • contaminants can be identified with "Cont_" in the column Master Protein Descriptions (which does not contain all the accessions matching the peptide). actually this may be an issue? I should maybe change the parsing rule so that it is in the column Protein Accessions. Or we need to have somewhere a list of all the contaminant accessions in the fasta. Or maybe we can use the column Contaminant? But I would have to double check how this was set up in the search parameters.
  • then for quant I would suggest to use the values of the columns:
    Abundances Normalized F1 Sample ConditionA, Abundances Normalized F2 Sample ConditionA, Abundances Normalized F3 Sample ConditionA, Abundances Normalized F4 Sample ConditionB, Abundances Normalized F5 Sample ConditionB, Abundances Normalized F6 Sample ConditionB
    the sample names may vary depending on how users set up their analysis. I have to check how to make it consistant...

So... a few points are not completely clear yet, sorry.

@mlocardpaulet
Copy link
Contributor Author

mlocardpaulet commented Dec 18, 2024

Regarding the contaminants:
I think we should do as if the "Cont_" were in the column Protein Accessions.
I will change the parsing of the fasta so that it works.
It means that with the file that I uploaded right now, no contaminants will be removed but should work on the next one.

I don't think that we should rely on the the column Contaminants. It would mean sending the fasta with only the contaminants, and some parameters to set up in PD.

@mlocardpaulet
Copy link
Contributor Author

So... here is a new file, where I did not indicate anything in terms of experimental plan, so the column names where the quantities are are:

Abundances (Normalized): F1: Sample
Abundances (Normalized): F2: Sample
Abundances (Normalized): F3: Sample
Abundances (Normalized): F4: Sample
Abundances (Normalized): F5: Sample
Abundances (Normalized): F6: Sample

Sub-optimal. I'll have to find something else. I suspect that they are ordered (A, A, A and B, B, B), but I am not entirely sure.

But the issues with the column Protein Accessions are fixed: it now contains all the accessions matching a given proteoform AND the prefix "Cont_" for contaminants.
MulticonsensusProteoBench_DDAmodule2_SequestHT_Percolator_Quanti_241218_PeptideGroups_trimmed.txt.zip

I will work on the documentation, and try to find a "simple" way to get column headers that make more sense for the quantification...

@RobbinBouwmeester RobbinBouwmeester merged commit 392cb3c into main Dec 19, 2024
8 checks passed
@RobbinBouwmeester RobbinBouwmeester deleted the peptidoform-module-ProteomeDiscoverer branch December 19, 2024 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants