-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update curated papers, add new PDB provider (FURNA). #1193
Conversation
…ate new prefix for PEPhub. Curate new provider for PDB
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1193 +/- ##
==========================================
+ Coverage 42.51% 43.56% +1.05%
==========================================
Files 117 118 +1
Lines 8327 8176 -151
Branches 1963 1343 -620
==========================================
+ Hits 3540 3562 +22
+ Misses 4582 4450 -132
+ Partials 205 164 -41 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that PEPhub makes sense as an entry in the Bioregistry, since it itself has pseudo-curies as identifiers. Instead, it makes sense to make providers for geo
. It also appears that they prefix GEO entries with other pseudo-prefixes like bedbase
, which might be some other wrapper service around GEO.
You can see the example where The Gene Curation Coalition also wraps HGNC in https://bioregistry.io/registry/hgnc
Further, this PR is doing a few different things. Please split it up into multiple PRs that are each about a single thing
Thanks @cthoyt, let's discuss options:
GEO is more of an exception than the general rule here in that most entries are not easily recognizable community resources like GEO (e.g., https://pephub.databio.org/xuebingjie1990/excludranges, etc.). From the perspective of "I want to be able to refer to a PEPhub entry via a CURIE and resolve it on the web", I think it makes sense to curate it as a prefix. If it were only providing GEO-type entries, I agree a provider or even a registry would be a better fit. Still, of course this is not a clean cut case either way.
This is something I recommented to @nagutm because the changes here correspond to having reviewed a small batch of papers from the list of predictions for potential relevance. I suggested making PRs with changes on ~10 papers so that the rows added to the curation spreadsheet correspond to changes made in Bioregistry data files in a single PR. I suppose this could be broken up into multiple PRs though where each addition to the curation spreadsheet that results in a specific data update would be its own PR. |
Actually, this made me realize that PEPHub as a prefix is most similar to GitHub as a prefix: https://bioregistry.io/registry/github, where the pattern for GitHub's IDs ( |
Thank you for the input @cthoyt. To summarize, the possible courses of action are:
For now I have removed the commits relating to PEPhub for this PR and will make a separate PR referencing this discussion until it is resolved. |
The Documentation tests fail with
but I'm not sure anything related to Graphviz changed in this PR. Possibly a new dependency got pulled in the build which breaks the documentation generation. |
Before finishing this, can you explain what's the concept between the curated_papers.tsv and actually making curations in the bioregistry.json? Are these supposed to be done concurrently, or it was just a exception that one of the curations in bioregistry.json accompanied literature curations here? Is the plan to track which literature curation goes with each additional provider? I don't see an actual addition of the publication to the PDB publication list in bioregistry.json in this file. Would be important to clarify these things and put them in the markdown file in #1195 (and ideally, merge that either together with this PR or in quick succession) |
My original plan was to review a batch of (e.g., 10) papers, update curated_papers.tsv with the results of curation, and make any corresponding changes in bioregistry.json, then submit a PR with all of these together. However, one of the first things you asked for is to separate out each bioregistry.json change into a separate PR. @nagutm did this and so this PR just has a single purpose change to bioregistry.json, and other additions were spun out into separate PRs. This makes the scope of this PR a bit strange but this is a one time thing. Going forward, I assume each PR will contain a single purpose change to bioregistry.json (following your suggestion) along with curation(s) in curated_papers.tsv. |
In terms of
I totally agree, every curation going into bioregistry.json should obviously refer to the PMID that it was curated based on. It's just that it wasn't clear where to put a citation that is for a provider, not the prefix itself. Would the PMID for Furna go into the publication list for |
With the addition of the pr_added column in the TSV file I think separating individual curations into separate PRs makes the most sense going forward. It will also help to isolate any ambiguous cases like pephub so that the discussion doesn't get lost in a long thread like this one. Does it make sense to create separate PRs for the irrelevant papers added to the TSV file moving forward or group them in with independent curation PRs? |
agreed, let's extend the data model for the provider. See #1207 |
@nagutm the data model now allows adding publications to providers so please add Furna's publication, then we can wrap this up. |
Updated curated papers list with all papers identified from Aug 9th in #1165.
Curated new provider for PDB Structure called 'furna'. PMID: 39074139