-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent metrics result between eval and denovo mode, and possible reason #122
Comments
Hi, Thanks for pointing out the pytorch lightening bug relating to metric accumulation over steps, looks like that's reason behind discrepancy in results, and also the duplicate stop token checking - this shouldn't be happening in the current implementation but we still added a check with the recent PR #123 to catch these. Also sorry for the delay but you can find the Casanovo peptide/AA precision values at coverage=1 for the updated 9-species benchmark below (corresponding to Figure S2 and S5 from preprint):
|
Thank you very much for the values! For preprint Figure S2 and S5 only:
Thanks! |
|
A similar situation has been discover in train mode, the "AA precision AA recall Peptide recall" of validation dataset displayed on console is also much higher than the actual accuracy, which might also be caused by pytorch-lightning codes in model.py. I would also suggest not use these values, I have compare them with manually calculated AA precision and peptide recall values from output mztab files, these displayed values are consistently higher. They can only be used as an indicator of when to stop training but not accurate performance of benchmark dataset. |
Thanks for pointing this out - we're also relying on manual calculation of precision values when reporting results |
Notes from our discussion just now: One hypothesis (from Wout) is that this may be due to a mismatch in the batch size, and it throws the calculation off. Another is that it has to do with how Pytorch Lightning aggregates results at the level of epoch. But if that's the case, then others would have run into this as well. Googling did not turn anything up. |
Thanks for letting me know! |
I am reopening this because we are still hoping to track down this discrepancy. It might not be immediate, though. |
I wrote a script to calculate the AA and Peptide precision of the Casanovo PSMs, and I am seeing a minor discrepancy between the output of my script and Casanovo's evaluate mode. I ran the script and evaluate mode on the sample data, which has no skipped spectra during sequencing, so the error is unlikely to be originating from that. From what I can tell the discrepancy is either from an error in my script or we're still seeing batch effects in eval mode. Here's the output from my script and eval mode:
Also, here is the test script: import depthcharge.masses
import casanovo
import depthcharge
import pyteomics.mztab
import casanovo.denovo
import casanovo.denovo.evaluate
def process_mgf_file(mgf_path):
with open(mgf_path) as f:
return [
line[4:].strip() for line in f if line.startswith("SEQ=")
]
def main():
mz_tab = pyteomics.mztab.MzTab("foo.mzTab")
output = mz_tab.spectrum_match_table["sequence"].tolist()
ground_truth = process_mgf_file("sample_data\sample_preprocessed_spectra.mgf")
peptide_masses = depthcharge.masses.PeptideMass()
aa_precision, _, pep_precision = casanovo.denovo.evaluate.aa_match_metrics(
*casanovo.denovo.evaluate.aa_match_batch(
ground_truth,
output,
peptide_masses.masses
)
)
print(f"AA Precision: {aa_precision}")
print(f"Pep Precision: {pep_precision}")
if __name__ == "__main__":
main() |
Your script is correct, so I think it's still an issue with aggregating the results across multiple batches. Maybe this can be relatively easily checked by running evaluation with a single batch only (by expanding the batch size and/or reducing the number of spectra). Although, come to think of it, the batch size for inference should standard be 1024, so maybe that's already the case. Could you double-check this? |
I haven't looked at the code, but doesn't it seem a bit crazy to try to compute these metrics in a batchwise fashion? Why not just aggregate a score vector and a label vector and then do a single call to the metric calculation at the end? |
Now it's calculated at the end of each validation batch, where the loss is also calculated. Otherwise you have to keep all of the predictions and true labels in memory across the whole run, which can easily be on the order of millions. So in terms of code it's not really crazy but actually a natural way to do it. But maybe leading to slightly incorrect results then. |
I checked the default config and batch size for inference does indeed default to 1024. As for the batch effects, I may be misunderstanding the amino acid and peptide precision scores but wouldn't be possible to keep a count of the number of correct amino acids and peptides, as well as the total number of amino acids and peptides, and use this values to do the precision scores calculation at the end of an evaluation run? |
We should first verify what the actual problem is. This batched data aggregation was only my hypothesis, we haven't isolated the issue yet. You could debug the code I referred to a bit more to see what the values there are and try to find the point where they deviate. |
After some further investigation it turns out that this issue persists even in the case where there's only one validation step. I'm still trying to track down the exact discrepancy though. |
This is a follow-up of issue #111, thanks to all the information provided, I am able to find a possible bug and its reason.
I did experiment with weights "pretrained_excl_yeast.ckpt", deepnovo original raw yeast dataset, and Casanovo version 3.1.0.
In eval mode (results displayed on console):
aa_precision: 0.7057459339992785
aa_recall: 0.7071716086435299
pep_recall: 0.5046985050978404
In denovo mode (results manually calculated from output mztab file):
aa_precision: 0.6827751198617065
aa_recall: 0.6843725335842246
pep_recall: 0.5046985051027288
There is difference between results from eval and denovo mode (aa_precision, aa_recall), so I studies model.py and its pytorch lightning implementation.
I first excluded the possibility that this may be caused by predictions without '$' or with multiple '$$' (it turns out there are no such predictions), because this line of code removes the first position of prediction without checking whether it is '$', I am referring to v3.2.0, the code in v3.1.0 is similar.
Then I checked validation part of the code, and find the inconsistence might be caused by the way pytorch lightning accumulates metrics from each step and calculates per-epoch metrics. So I manually accumulates the prediction data from each step and calculate metrics for one epoch, and find it is the same as the results from denovo mode, which verifies my guess. For manual calculation, I use methods from evaluate.py.
Manual calculation from validation_step() output:
aa_precision: 0.6827751198617065
aa_recall: 0.6843725335842246
pep_recall: 0.5046985051027288
In order to further verify, I tried with deepnovo raw ricebean dataset, and see the same pattern.
In eval mode (results displayed on console):
aa_precision: 0.5835929214378351
aa_recall: 0.583695772587527
pep_recall: 0.4223428193208169
In denovo mode (results manually calculated from output mztab file):
aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586
pep_recall: 0.4223428193248386
Manual calculation from validation_step() output:
aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586
pep_recall: 0.4223428193248386
So the conclusion is, pytorch lightning can accumulate pep_recall from each step and calculate it correctly for each epoch, but it may fail the same job for aa_precision and aa_recall.
Ps. I again would like to request numerical values of Casanovo's AA/peptide precision/recall for updated 9-species benchmark (Figure S5 from preprint). Thanks a lot.
The text was updated successfully, but these errors were encountered: