-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download weight file from URL #349
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #349 +/- ##
==========================================
+ Coverage 89.98% 94.03% +4.04%
==========================================
Files 12 12
Lines 979 1022 +43
==========================================
+ Hits 881 961 +80
+ Misses 98 61 -37 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments.
Two key things are still missing as well:
- Documentation of this functionality on readthedocs (maybe on the file formats page).
- Unit tests for the download behavior.
You could also check the "last-modified" information for the URL and verify whether this is more recent than the download time or not. |
I adjusted the hashing implementation such that now files downloaded from a user specified URL will now be downloaded to |
In regards to the readthedocs documentation it looks like weight files aren't mentioned at all in the "Input Files" section on the file formats page. Is this something we want to add? |
Yes, this new (and also old) behavior of weights file downloading should still be added to the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's starting to look quite nice for a complex piece of code. A few more comments.
casanovo/casanovo.py
Outdated
file_response = requests.head(file_url) | ||
if file_response.ok: | ||
if "Last-Modified" in file_response.headers: | ||
url_last_modified = email.utils.parsedate_to_datetime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I'm reluctant to include a library with rather different functionality for just a utility function (even though it's part of the standard libraries). Do we need extra functionality that strptime
doesn't offer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I changed it to use strptime instead.
casanovo/casanovo.py
Outdated
import logging | ||
import urllib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I don't think this is necessary (rather the relevant import is urllib.parse
a few lines further down).
casanovo/casanovo.py
Outdated
if _is_valid_url(model): | ||
model = _get_weights_from_url(model, cache_dir) | ||
elif not Path(model).is_file(): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Add a similar logging statement.
docs/file_formats.md
Outdated
@@ -11,6 +13,14 @@ When you're ready to use Casanovo for *de novo* peptide sequencing, you can inpu | |||
All three of the above file formats can be used as input to Casanovo for *de novo* peptide sequencing. | |||
As the official PSI standard format containing the complete information from a mass spectrometry run, mzML should typically be preferred. | |||
|
|||
### Model Weights | |||
|
|||
In addition to MS/MS spectra, Casanova also optionally accepts a model weights (.ckpt) input file when running in training, sequencing or evaluating mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "Casanovo"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice implementation and great work on the unit tests. 👏
In order to address #108, the
--model
parameter now supports URLs. If the--model
parameter is a valid URL the Casanovo CLI will first look in the cache for the weight file (cache resolution is done via hashing the URL), and if the weight file isn't cached the CLI will attempt to download the weight file and cache it. The CLI will raise an error if the--model
parameter is set but it's not a valid URL or file in the local filesystem.However this implementation does have an issue where if the weight file at a URL is updated, but the CLI has previously downloaded weights from that URL, the new weights won't be downloaded because there will still be a cache hit. To fix this I could introduce a
--force
option that skips the cache and always attempts to download and cache weights given a URL.