-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate OpenData datasets to Zenodo #16
Comments
Hi Jen @jhammock, |
Ah yes, we'll need to decide what to do about those. I expect the files used in resource connectors should go into the new docker container. I'll review the "old resources"; possibly those can go into Zenodo as well, but I'll check them individually. |
@jhammock @KatjaSchulz
In the actual OpenData resource record, the URL is now transformed in this format (shorter):
Nonetheless, both URL formats are accessible. I needed this done before I migrate anything to Zenodo. |
Thanks for the update, Eli! I'll appreciate not having that to worry about until we're migrated :) |
Wonderful! Thanks Eli.
…On Sun, Aug 4, 2024 at 7:49 PM Jen Hammock ***@***.***> wrote:
Thanks for the update, Eli! I'll appreciate not having that to worry about
until we're migrated :)
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSQBNDIT4XAMXRJXLQ5ACTZP24YPAVCNFSM6AAAAABLVVVQTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRXHE2DOOJQGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Jen, @jhammock
Thanks. |
@jhammock @KatjaSchulz You can also search by Subject: More search tips here. |
@jhammock @KatjaSchulz |
@jhammock @KatjaSchulz |
I think that status aught to suit most if not all such cases. @KatjaSchulz , we should both check, I suppose. If there's something we don't want to even announce that we have, we can move it offline for now. |
1st private record (restricted) e.g. WoRMS internal: World Register of Marine Species |
Status:
|
No concerns about the test dataset. It may not be the one currently in use, and we can always make up another. |
@eliagbayani I'm trying to orient myself to the zenodo interface. Can you explain this to me? https://zenodo.org/records/13253933/files/13253933.dat?download=1 It's listed under "Files" at https://zenodo.org/records/13253933 |
@jhammock |
So the plan is for the intended files to replace the temp file ultimately, wherever it appears? Is manual editing needed? |
Yes, this one needs manual editing. |
@jhammock, here is the New Version you initiated but was not completed. |
I can't remember starting that process so I discarded it. Just checking:
|
@jhammock case 2 - Or we provide just the URL e.g. https://eol.org/data/full_provider_ids.csv.gz as metadata in Zenodo record. |
OK, I can see advantages to both cases, but if zenodo policy permits, I think I crave the redundancy of them hosting a copy of all files we list there. We'd presumably also have one of everything, eventually in your new docker instance, @eliagbayani . @KatjaSchulz do you concur? |
Yes I vote for redundancy as well. Thanks. |
Okay, I am getting familiar with zenodo metadata edits. I gather a new version of a resource is only required when the files associated with the record are changed. I have created v2 of the identifier map. I have also messed with some of the metadata, in several subsequent edits, and learned that this can be done while preserving the same version-specific doi. Yay! @KatjaSchulz you should definitely review this one because I named you as the creator. You may prefer to name an institution, which is an option, or to name several creators. I am implicated also for the moment, in the contributor category, as a "contact person". We should probably hash out a policy about this kind of metadata in the zenodo context; the aggregate datasets will probably be case by case, but for the resource files we should be able to do something consistent- or a few different consistent things over different kinds of resources. |
Thanks Eli, this will be very useful.
…On Tue, Sep 17, 2024 at 11:52 AM Eli Agbayani ***@***.***> wrote:
@jhammock <https://github.com/jhammock> @KatjaSchulz
<https://github.com/KatjaSchulz>
Tip: If you know the complete title of your record in Zenodo.
And you try to search it. Paste this in the search textbox:
title:("Your Complete Title")
You can also search by Subject:
subject("EOL Content Partners: Water Body Checklists")
More search tips here. <https://help.zenodo.org/guides/search/>
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSQBNEJP52BLCOOXPTSOITZXBF2XAVCNFSM6AAAAABLVVVQTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUGYZDCNRSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@jhammock @KatjaSchulz Where its latest version is now: Jen, Or do you want these records handled manually by you and Katja? |
Thanks, Eli!
Give us a moment to go through this list; at a glance a couple of these may
just be odd ducks to be archived, or otherwise treated differently. I
expect most of them will want that script, on a regular schedule.
More soon!
Jen
…On Wed, Sep 18, 2024 at 11:03 AM Eli Agbayani ***@***.***> wrote:
@jhammock <https://github.com/jhammock> @KatjaSchulz
<https://github.com/KatjaSchulz>
Attached is a list of records where files are saved elsewhere (n=56).
If I'm not mistaken, all should have a .dat file as its uploaded file.
Except for one:
[title] => identifier map: current version
[URL] => https://eol.org/data/full_provider_ids.csv.gz
[Zenodo] => https://zenodo.org/records/13253933
Where its latest version is now:
EOL full taxon identifier map
https://zenodo.org/records/13751009
Jen,
Question, do you want me to proceed and create/run a script that will
check the URLs if valid and upload the actual file to its respective Zenodo
record?
Of course a new version of the record will be created (Version 2) to have
the uploaded file.
If the URL is already broken then I don't change anything.
Or do you want these records handled manually by you and Katja?
Thanks.
FilesSavedElsewhere.txt
<https://github.com/user-attachments/files/17046039/FilesSavedElsewhere.txt>
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXC5B2B72EZLAGRW4TGGF3ZXGI4ZAVCNFSM6AAAAABLVVVQTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJYG4ZDKNRWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Eli, Jen & I just did a deep-dive on Zenodo and came up with a list of things we would like to change. Here are the things we hope you can do through the API:
Let us know if you have any questions. |
Hi Jen, @jhammock I think I set these records initially to 'Restricted'. |
@KatjaSchulz @jhammock Another API setback is that it cannot assign identifiers (e.g. ORCID) to Creators and Contributors. Anyway, the rest of the requirements were met fine. Also I removed all Contributors with my name 'Eli Agbayani'. These are just remnants of the old CKAN framework. |
@jhammock Arctic Biodiversity: Arctic Freshwater Fishes Africa Tree Database Fairbairn, 2013 Ramirez, et al, 2008: Ramirez et al, 2008 Only the 2nd record among these pairs were processed. |
@KatjaSchulz
Or only add 'geography' strictly for values: without " 2019"
Thanks. |
Good question! We mulled that over, but based on the zenodo search tools decided not. We're not confident of being able to filter conveniently to exclude deprecated datasets, so we don't want to give those any other tags. |
I've started to mess around with tags and metadata and wanted to check something before I make a mess. Eventually, we'll need a mapping of old CKAN addresses to their corresponding zenodo addresses in order to update the resource file links in the harvesting layer. I wouldn't say automating this is super important, but if we have such a mapping already or could easily make one it will certainly be useful, and I want to make sure I'm not messing that up. I've started editing the Related Works metadata, adding two things so far:
But more urgently, @eliagbayani , I've deleted a few "is supplement to" relationships, (like this one, not yet removed) thinking we only needed them in case of the file upload difficulties we had earlier. However, if those relationships are present on all our zenodo records, and are the easiest way to trace them back to the ckan records, perhaps I should hold off. Please let me know, what you think about that ckan<->zenodo mapping and in particular if I should leave the supplement relationships alone for that or any other reason. I do want to remove them eventually to avoid confusing our zenodo visitors, but there's no great rush. |
@jhammock , |
Thanks. |
Thanks for that quick investigation, Eli! Yes, that mapping looks like it will make the updating of our harvest layer links very easy when the time comes. So the important thing is for me not to bother the is-supplement relationships for the live connector resources. Where's the best place for me to refer to for a list of those? In the Jenkins? If you can handily automate the is-source relationships, that would be grand; if not, no complaints. Let me know- if it is, I'll remove the ones I've entered manually, so you can make a clean job of the whole collection. That'll only need to be done once, and I'll probably end up removing a few afterwards. Not everything with a resource page in the publishing layer is published, approved, and non-redundant :) |
|
@jhammock, |
Splendid! I'll leave that to you, then. Thanks :) |
Finally finished adding Related Works -> 'is source of' relationships in Zenodo for all published EOL resources. |
I've just finished my first round of manual interventions. Hopefully I haven't made too much of a mess and I have learned a few things. To address some of the mess I have made, I hope this is possible: for all records bearing tag=deprecated and also a Related Works -> 'is source of' relationship, remove the deprecated tag. That will un-deprecate several hundred GBIF 2019 checklists which I mistakenly thought were the older version. To balance this out, so to speak, it would be great if all the non-2019 GBIF checklists (eg: https://zenodo.org/records/13313155) could be identified and given a "deprecated" tag. These have a few attributes in common that may be helpful, apart from the titles containing the string "Checklists:" w/o "2019". I think they all have Creator=Anne Thessen (the 2019 versions do not) and the sample I checked have create date = 2018-01-something (water bodies) or 2017-11-something (national) |
We're making some progress on the attribution fields. @eliagbayani , I'd like you to be indicated as a Contributor on any file produced by a connector. Use whatever identifier or format you like. Offhand I'd suggest role=Data Manager, but if you see another value that makes more sense to a professional eye, you know best! I'm hoping the Jenkins and/or your connector repository will help you to inventory all the relevant records. |
@JRice when you have a moment, could you list all our bespoke data export products to which you hold the keys? I want to put your name on them in zenodo. Offhand I'm thinking you may be implicated in but some of those may be in Eli's court and there may be others that I've missed. |
@eliagbayani A curiosity: https://zenodo.org/records/13381012 should, I think, have a related work, "is source of https://eol.org/resources/459". I'm guessing something about our visibility settings either in CKAN or in zenodo prevented that relationship from being populated- but I've been wrong before in identifying our zenodo records. Have I got the right one? If so I'll fill in the relationship manually. |
Yes, I assume it being 'restricted' excluded it during the bulk update. Yes we can manually add the isSourceOf relationship URL https://eol.org/resources/459 . Thanks. |
@JRice <https://github.com/JRice> when you have a moment, could you list
all our bespoke data export products to which you hold the keys? I want to
put your name on them in zenodo. Offhand I'm thinking you may be implicated
in
-the all-traits export
-the taxon ID map
-the media manifest
-the translatewiki export:
https://eol.org/data/term_name_translations.json
but some of those may be in Eli's court and there may be others that I've
missed.
Only these:
The identifier map (provider_ids.tgz)
The full identifier map (full_provider_ids.tgz)
The media manifest (*_manifest.tgz)
The sitemap (but this doesn't end up in Zenodo; just adding for
completeness)
|
@jhammock , question please. |
OK, here's an ideal scenario which may not be practical. The 2019 checklists are derivative works based on the non-2019 checklists. Strictly speaking, each 2019 file should have a derived_from relationship to the non-2019 file for the same country or marine region. I think the titles are all strictly matched (with "2019" appended). Does that make it fairly easy? If so, I think that suffices to document Anne's role. If it's not practical, then yes, we should attach her directly to the 2019 checklists in the contributor field with role=data manager. Thanks! |
Oh adding the derived_from relationship to the 2019 files is indeed a better solution. I will do that instead. |
Updates: All these records are now tagged as 'deprecated': And soon all the 2019 records will have isDerivedFrom relation, bulk update for this is currently running. |
Updates:
Thanks. |
For your next trick, @eliagbayani , we are hoping you can plunder the description field througout our zenodo collection, and identify doi strings where they are mentioned. There may be one (as in Jen's dozens of tiny one-source resource files) or multiple dois (as in Katja's grander trait compilations) in the description field. We'd love to have these nice standard identifiers in a structured form and field of their own. For each doi found, could you please generate a related works record, with relation=References, scheme=doi, and resource type=publication? I think you can put the doi string directly into the identifier field in any of the formats we seem to have used, without further modification. The formats are pretty variable, which might make finding the dois an interesting adventure. I've added a (random?) sample below. If you don't find them all, that's not critical; we'll carry on adding them manually for new resources anyway; we're just a bit daunted at the idea of catching up with all the ones that are already mentioned. https://doi.org/10.5061/dryad.37pvmcvsj |
@jhammock |
I don't know of a reason to prefer either one- @KatjaSchulz have you a preference? |
This draft looks great, @eliagbayani ! I did find one case that didn't receive a relationship. Any ideas? It does have the same doi mentioned twice in exactly the same format, if that could be an issue. (I'll edit the description later, but wanted you to see as is) |
Yes, it is weird that this record was excluded in my subset of records with string 'doi' found in description. It is not because of the double entry. Anyway I ran it by itself and it now has DOI in Related Works. Thanks Jen. |
Steps:
The text was updated successfully, but these errors were encountered: