Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update curated papers, add new PDB provider (FURNA). #1193

Merged
merged 27 commits into from
Oct 19, 2024

Conversation

nagutm
Copy link
Collaborator

@nagutm nagutm commented Oct 7, 2024

Updated curated papers list with all papers identified from Aug 9th in #1165.
Curated new provider for PDB Structure called 'furna'. PMID: 39074139

Mufaddal Naguthanawala added 2 commits October 1, 2024 18:48
Copy link

codecov bot commented Oct 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.56%. Comparing base (8950e70) to head (ad69445).
Report is 96 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1193      +/-   ##
==========================================
+ Coverage   42.51%   43.56%   +1.05%     
==========================================
  Files         117      118       +1     
  Lines        8327     8176     -151     
  Branches     1963     1343     -620     
==========================================
+ Hits         3540     3562      +22     
+ Misses       4582     4450     -132     
+ Partials      205      164      -41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that PEPhub makes sense as an entry in the Bioregistry, since it itself has pseudo-curies as identifiers. Instead, it makes sense to make providers for geo. It also appears that they prefix GEO entries with other pseudo-prefixes like bedbase, which might be some other wrapper service around GEO.

You can see the example where The Gene Curation Coalition also wraps HGNC in https://bioregistry.io/registry/hgnc

Further, this PR is doing a few different things. Please split it up into multiple PRs that are each about a single thing

@bgyori
Copy link
Contributor

bgyori commented Oct 8, 2024

Thanks @cthoyt, let's discuss options:

I don't think that PEPhub makes sense as an entry in the Bioregistry, since it itself has pseudo-curies as identifiers.

GEO is more of an exception than the general rule here in that most entries are not easily recognizable community resources like GEO (e.g., https://pephub.databio.org/xuebingjie1990/excludranges, etc.). From the perspective of "I want to be able to refer to a PEPhub entry via a CURIE and resolve it on the web", I think it makes sense to curate it as a prefix. If it were only providing GEO-type entries, I agree a provider or even a registry would be a better fit. Still, of course this is not a clean cut case either way.

Further, this PR is doing a few different things. Please split it up into multiple PRs that are each about a single thing

This is something I recommented to @nagutm because the changes here correspond to having reviewed a small batch of papers from the list of predictions for potential relevance. I suggested making PRs with changes on ~10 papers so that the rows added to the curation spreadsheet correspond to changes made in Bioregistry data files in a single PR. I suppose this could be broken up into multiple PRs though where each addition to the curation spreadsheet that results in a specific data update would be its own PR.

@bgyori
Copy link
Contributor

bgyori commented Oct 8, 2024

Actually, this made me realize that PEPHub as a prefix is most similar to GitHub as a prefix: https://bioregistry.io/registry/github, where the pattern for GitHub's IDs (^[a-zA-Z0-9-_]+(/[a-zA-Z0-9-_]+)?$) is very similar to what was curated here (^[A-Za-z0-9_\\-]+(/[A-Za-z0-9_\\-]+)?$).

@nagutm
Copy link
Collaborator Author

nagutm commented Oct 8, 2024

Thank you for the input @cthoyt.

To summarize, the possible courses of action are:

  1. Add PEPhub as a prefix. Although majority of the entries are currently related to GEO, if in the future more entries are made that are independent of that then it makes sense to have a way to resolve other PEPhub entries.
  2. Add PEPhub as a provider for GEO. Besides GEO, the remaining identifiers on PEPhub are not very populated and seem to be quite experimental in nature. Example: https://pephub.databio.org/chlescure/new-project?tag=hello. If this remains the case for the near future then curating it as a provider might make sense for now and we could reassess in the future.

For now I have removed the commits relating to PEPhub for this PR and will make a separate PR referencing this discussion until it is resolved.

@nagutm nagutm changed the title Update curated papers, PEPhub prefix and new PDB provider. Update curated papers, add new PDB provider (FURNA). Oct 8, 2024
@nagutm nagutm mentioned this pull request Oct 8, 2024
@bgyori
Copy link
Contributor

bgyori commented Oct 17, 2024

The Documentation tests fail with

Warning, treated as error:
dot command 'dot' cannot be run (needed for graphviz output), check the graphviz_dot setting
docs-test: exit 2 (15.88 seconds) /home/runner/work/bioregistry/bioregistry/docs> sphinx-build -W -b html -d /home/runner/work/bioregistry/bioregistry/.tox/docs-test/tmp/build/doctrees /home/runner/work/bioregistry/bioregistry/.tox/docs-test/tmp/source /home/runner/work/bioregistry/bioregistry/.tox/docs-test/tmp/build/html pid=2071
  docs-test: FAIL code 2 (24.62=setup[8.73]+cmd[0.00,0.00,15.88] seconds)
  evaluation failed :( (24.70 seconds)
Error: Process completed with exit code 2.

but I'm not sure anything related to Graphviz changed in this PR. Possibly a new dependency got pulled in the build which breaks the documentation generation.

@cthoyt
Copy link
Member

cthoyt commented Oct 17, 2024

@bgyori fix in #1205

@cthoyt
Copy link
Member

cthoyt commented Oct 17, 2024

Before finishing this, can you explain what's the concept between the curated_papers.tsv and actually making curations in the bioregistry.json? Are these supposed to be done concurrently, or it was just a exception that one of the curations in bioregistry.json accompanied literature curations here? Is the plan to track which literature curation goes with each additional provider? I don't see an actual addition of the publication to the PDB publication list in bioregistry.json in this file. Would be important to clarify these things and put them in the markdown file in #1195 (and ideally, merge that either together with this PR or in quick succession)

@bgyori
Copy link
Contributor

bgyori commented Oct 17, 2024

My original plan was to review a batch of (e.g., 10) papers, update curated_papers.tsv with the results of curation, and make any corresponding changes in bioregistry.json, then submit a PR with all of these together. However, one of the first things you asked for is to separate out each bioregistry.json change into a separate PR. @nagutm did this and so this PR just has a single purpose change to bioregistry.json, and other additions were spun out into separate PRs. This makes the scope of this PR a bit strange but this is a one time thing. Going forward, I assume each PR will contain a single purpose change to bioregistry.json (following your suggestion) along with curation(s) in curated_papers.tsv.

@bgyori
Copy link
Contributor

bgyori commented Oct 17, 2024

In terms of

Is the plan to track which literature curation goes with each additional provider? I don't see an actual addition of the publication to the PDB publication list in bioregistry.json in this file.

I totally agree, every curation going into bioregistry.json should obviously refer to the PMID that it was curated based on. It's just that it wasn't clear where to put a citation that is for a provider, not the prefix itself. Would the PMID for Furna go into the publication list for pdb? Or should we extend the provider data model to allow publications tied directly to a provider?

@nagutm
Copy link
Collaborator Author

nagutm commented Oct 17, 2024

With the addition of the pr_added column in the TSV file I think separating individual curations into separate PRs makes the most sense going forward. It will also help to isolate any ambiguous cases like pephub so that the discussion doesn't get lost in a long thread like this one.

Does it make sense to create separate PRs for the irrelevant papers added to the TSV file moving forward or group them in with independent curation PRs?

@cthoyt
Copy link
Member

cthoyt commented Oct 17, 2024

agreed, let's extend the data model for the provider. See #1207

@bgyori
Copy link
Contributor

bgyori commented Oct 18, 2024

@nagutm the data model now allows adding publications to providers so please add Furna's publication, then we can wrap this up.

@cthoyt cthoyt merged commit 6d0902d into biopragmatics:main Oct 19, 2024
13 checks passed
@nagutm nagutm deleted the update_curated_papers branch October 28, 2024 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants