-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced Sequencing Run Logging #343
Conversation
Changed base to |
Added more information to the end of run report, namely the number of skipped spectra. Unfortunately the model never sees the spectra that are skipped due to pre-check errors such as having invalid precursor info, so these aren't reflected in the end of run report. They do however appear at other points in the log. A sample log is available below:
|
Can you describe a bit what the |
For sure,
|
Imo this is a bit over-engineered. Instead, we can just log directly where specific information is available using the Python loggers, rather than having to pass it all upstream to these new interfaces. Your implementation is precisely with the Python logger does after all: having a universal interface that multiple writers can hook into and write to different outputs. So we don't need to reinvent that. So I'd just log the PSM statistics in the |
Oh ok sounds good - so just to be clear should all of the PSM processing (e.g. the score distribution calculation) and logging happen in the |
Yes, I think that's the logical place, considering that all PSMs are aggregated there. Of course the over-engineering is my personal feeling. We could ask @wfondrie to weigh in as well. Logging the number of skipped spectra can indeed be delayed for now. |
I agree that this solution is overengineered for addressing just the post run logging. My reasoning for implementing it this way is having a PSM io interface would make it relatively easy to extend the PSM io in general. For example if we want to support other output formats like parquet or csv in the future and add a command line option to specify multiple output formats (say you want to write the PSMs to csv and mzTab) it would be relatively easy to do this, or if an end user wants to write PSMs to an external component like a database they could write their own PredictionWriter module to handle this. |
Fair enough, I'll get everything moved to the mzTab writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these comments are very small things that you can quickly fix.
One bigger comment is that on reflection the output writer is also not the best place to do this logging. Instead, a better location seems casanovo.py
, where the sequencing is actually executed. The PSM statistics can still be retrieved from the MztabWritter through the ModelRunner.
Additionally, we could still try to generalize it a bit. The runtime, hostname, GPU memory consumption, etc. are equally relevant to a training run. So maybe some general logging for any Casanovo execution, with then a small adaptation for the sequencing run to include PSM statistics there as well.
Implemented additional logging functionality in order to provide an enhanced end of sequence run log. This takes the form of an end of sequencing run report, an example is shown below:
All the data to generate this report is recorded on the fly as inference is conducted (i.e. no file parsing). To facilitate this the functionality of the data submodule was extended to add infastructure for run time logging. This will hopefully make it relatively easy to further extend logging and other io functionality in the future.