Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate OpenData datasets to Zenodo #16

Open
eliagbayani opened this issue Jul 30, 2024 · 66 comments
Open

Migrate OpenData datasets to Zenodo #16

eliagbayani opened this issue Jul 30, 2024 · 66 comments
Assignees

Comments

@eliagbayani
Copy link
Collaborator

eliagbayani commented Jul 30, 2024

Steps:

  • retrieve opendata.eol.org datasets using CKAN API
  • assign which fields from CKAN API will correspond to fields in the Zenodo API.
  • create Zenodo datasets using Zenodo API
@eliagbayani eliagbayani self-assigned this Jul 30, 2024
@eliagbayani
Copy link
Collaborator Author

Hi Jen @jhammock,
Attached is a list of private datasets from our five organizations in opendata.eol.org.
I will exclude these datasets in migration to Zenodo.
Unless you pick and want me to include some from the list.
Thanks.
private_datasets.txt

@jhammock
Copy link
Collaborator

jhammock commented Aug 2, 2024

Ah yes, we'll need to decide what to do about those. I expect the files used in resource connectors should go into the new docker container. I'll review the "old resources"; possibly those can go into Zenodo as well, but I'll check them individually.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Aug 4, 2024

@jhammock @KatjaSchulz
All broken URLs in opendata.eol.org are now once more accessible.
That is, those URLs written in this long format (previously broken) are now accessible:

In the actual OpenData resource record, the URL is now transformed in this format (shorter):

Nonetheless, both URL formats are accessible.
So we won't get any of these type of alerts anymore.

I needed this done before I migrate anything to Zenodo.
Admittedly, fixing the broken long URLs was an accident when I made the shorter URLs work :-)
Thanks.

@jhammock
Copy link
Collaborator

jhammock commented Aug 4, 2024

Thanks for the update, Eli! I'll appreciate not having that to worry about until we're migrated :)

@KatjaSchulz
Copy link

KatjaSchulz commented Aug 5, 2024 via email

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Aug 7, 2024

Hi Jen, @jhammock

  • After running a bulk migration and a number of manual migration that initially failed, all public datasets/resources from two organizations are now in Zenodo under the EOL Community
  1. Aggregate Datasets
  2. EOL Dynamic Hierarchy Data Sets
  • For review.
  • I used the keywords (Subjects) to have navigation to our resources. There is no equivalent organization->dataset->resources levels in Zenodo.

Thanks.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Aug 12, 2024

@jhammock @KatjaSchulz
Tip: If you know the complete title of your record in Zenodo.
And you try to search it. Paste this in the search textbox:
title:("Your Complete Title")

You can also search by Subject:
subject("EOL Content Partners: Water Body Checklists")

More search tips here.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Aug 19, 2024

@jhammock @KatjaSchulz
Update:
Generated an HTML page that will initially assist us in navigating the individual specific (public) resources in Zenodo.
This HTML page was organized using our OpenData's original sections: organizations -> datasets -> resources.
Zenodo doesn't have these type of sections.
opendata_zenodo.html.zip
Please unzip to get the HTML page. Thanks.

@eliagbayani
Copy link
Collaborator Author

@jhammock @KatjaSchulz
All public datasets are now in Zenodo.
I have not yet moved the private datasets from opendata.eol.org to Zenodo.
Do we need to do that?
If we do move them, they will take the 'restricted' option in Zenodo.
Restricted means, the record is publicly accessible, but files are restricted only to users with access.
Thanks.

@jhammock
Copy link
Collaborator

I think that status aught to suit most if not all such cases. @KatjaSchulz , we should both check, I suppose. If there's something we don't want to even announce that we have, we can move it offline for now.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Aug 27, 2024

1st private record (restricted) e.g. WoRMS internal: World Register of Marine Species
'Restricted' status works as intended. If you're not logged then you will not be able to download the file.
Will continue with the others.

@eliagbayani
Copy link
Collaborator Author

Status:
From: Aug 28

@jhammock
Copy link
Collaborator

jhammock commented Sep 3, 2024

No concerns about the test dataset. It may not be the one currently in use, and we can always make up another.

@jhammock
Copy link
Collaborator

@eliagbayani I'm trying to orient myself to the zenodo interface. Can you explain this to me?

https://zenodo.org/records/13253933/files/13253933.dat?download=1

It's listed under "Files" at https://zenodo.org/records/13253933

@eliagbayani
Copy link
Collaborator Author

@jhammock
The .dat file was a temporary file I used if the main file is not available during the migration. In this case the main file is:
https://eol.org/data/full_provider_ids.csv.gz
I assume during the time of migration this file was inaccessible after a number of tries thus it falls back to using the .dat file in order the record to be published.

@jhammock
Copy link
Collaborator

So the plan is for the intended files to replace the temp file ultimately, wherever it appears? Is manual editing needed?

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Sep 10, 2024

Yes, this one needs manual editing.
step 1: click [New Version]
step 2: upload the desired file, click button [Upload files]
step 3: enter the Publication date
step 4: finally click on [Publish] button.

@eliagbayani
Copy link
Collaborator Author

@jhammock, here is the New Version you initiated but was not completed.
https://zenodo.org/uploads/13741713
Just in case you are looking for it.

@jhammock
Copy link
Collaborator

I can't remember starting that process so I discarded it. Just checking:

  • The plan is to have zenodo host the files?
  • My uploading it now will not interfere with your ability to update it automatically later? I believe this file is updated on a regular schedule. I think you're dealing with connector-based resources first, but in due course I presume the aggregated data files will be equipped for updates also.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Sep 10, 2024

@jhammock
case 1 - Yes, eventually Zenodo can host the files.
Yes, your uploading it now, will not interfere with my ability to update it automatically later.
If case 1 is met, we don't need a .dat file anymore.

case 2 - Or we provide just the URL e.g. https://eol.org/data/full_provider_ids.csv.gz as metadata in Zenodo record.
If case 2 is met, we need to have a .dat file or any file (I chose .dat) uploaded to publish the Zenodo record.

@jhammock
Copy link
Collaborator

OK, I can see advantages to both cases, but if zenodo policy permits, I think I crave the redundancy of them hosting a copy of all files we list there. We'd presumably also have one of everything, eventually in your new docker instance, @eliagbayani . @KatjaSchulz do you concur?

@eliagbayani
Copy link
Collaborator Author

Yes I vote for redundancy as well. Thanks.

@jhammock
Copy link
Collaborator

Okay, I am getting familiar with zenodo metadata edits. I gather a new version of a resource is only required when the files associated with the record are changed. I have created v2 of the identifier map. I have also messed with some of the metadata, in several subsequent edits, and learned that this can be done while preserving the same version-specific doi. Yay!

@KatjaSchulz you should definitely review this one because I named you as the creator. You may prefer to name an institution, which is an option, or to name several creators. I am implicated also for the moment, in the contributor category, as a "contact person". We should probably hash out a policy about this kind of metadata in the zenodo context; the aggregate datasets will probably be case by case, but for the resource files we should be able to do something consistent- or a few different consistent things over different kinds of resources.

@KatjaSchulz
Copy link

KatjaSchulz commented Sep 17, 2024 via email

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Sep 18, 2024

@jhammock @KatjaSchulz
Attached is a list of records where files are saved elsewhere (n=56).
If I'm not mistaken, all should have a .dat file as its uploaded file.
Except for one:
[title] => identifier map: current version
[URL] => https://eol.org/data/full_provider_ids.csv.gz
[Zenodo] => https://zenodo.org/records/13253933

Where its latest version is now:
EOL full taxon identifier map
https://zenodo.org/records/13751009

Jen,
Question, do you want me to proceed and create/run a script that will check the URLs if valid and upload the actual file to its respective Zenodo record?
Of course a new version of the record will be created (Version 2) to have the uploaded file.
If the URL is already broken then I don't change anything.

Or do you want these records handled manually by you and Katja?
Thanks.
FilesSavedElsewhere.txt

@jhammock
Copy link
Collaborator

jhammock commented Sep 18, 2024 via email

@KatjaSchulz
Copy link

Hi Eli,

Jen & I just did a deep-dive on Zenodo and came up with a list of things we would like to change. Here are the things we hope you can do through the API:

  1. Agents
    1. For records that have Hosting institution: Anne Thessen under Contributors, remove the Contributors record, remove the "script (Zenodo API)" Creator and add the following as the new Creator:
      • Person
      • Name: Anne Thessen [important: do not link to any identifiers]
      • Affiliations: Encyclopedia of Life
      • Role: Data Manager
    2. For all other records that have "script (Zenodo API)" as the Creator, remove this Creator and add the following as the new Creator:
      • Organization
      • Name: Encyclopedia of Life
      • Role: Hosting Institution
    3. Remove all remaining Contributors with Role: Hosting Institution.
  2. Keywords & subjects
    1. For all data sets with keyword "EOL Content Partners: National Checklists 2019" or "EOL Content Partners: Water Body Checklists 2019" add keyword "deprecated"
    2. Remove all keywords with the prefix "format:", e.g., "format: ZIP", "format: TAR", "format: XML", etc.
  3. Notes: It looks like the Notes field in Zenodo currently contains a combination of the OpenData resource and organization description. We would like to handle this in a different way:
    1. Please move the content that's currently in the Zenodo Notes field to the Description field instead. If there is already content in the Description field, append the content from the Notes field.
    2. Please entirely remove this text from all Notes, i.e., do not include it in the text appended to the Description: "This is where EOL hosts source datasets (archives, dumps, etc.) from EOL content partners (especially partners without a web presence of their own). This organization will also include the content partner utility files EOL connectors use to generate a particular content partner__s resource EOL archive or XML. For questions or suggestions please visit the EOL Services forum at http://discuss.eol.org/c/eol-services ####--- __EOL DwCA resource last updated: .... ---####"

Let us know if you have any questions.

@eliagbayani
Copy link
Collaborator Author

Hi Jen, @jhammock
These are the 7 records under the EOL computer vision pipelines

I think I set these records initially to 'Restricted'.
I'm not sure if my recent bulk updates have accidentally set these to 'Public'.
Or have you set these to 'Public'?
If not I'll just set them back to 'Restricted'.
Thanks.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Sep 30, 2024

@KatjaSchulz @jhammock
The script finished doing the bulk updates. Zenodo
Like what I mentioned before, it seems Zenodo's 'write API' is lagging behind what the interface can do.
One is that the API cannot set a Creator to be of type 'Organization'. It always defaults to 'Personal'.
Also the API cannot set the 'role' of the Creator.
But it CAN set the 'role' of the Contributor.

Another API setback is that it cannot assign identifiers (e.g. ORCID) to Creators and Contributors.

Anyway, the rest of the requirements were met fine.

Also I removed all Contributors with my name 'Eli Agbayani'. These are just remnants of the old CKAN framework.
But I set others like 'Jen Hammock' or 'Sarah Miller' as 'Contact Person'. Please tell me if we need to change this.
And as proposed 'Anne Thessen' as 'Data Manager'.
Thanks.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Oct 5, 2024

@jhammock
Good catch Jen. Thanks.
Found the culprit: same titles, different records. It is also the same way these records were saved in CKAN.
The bulk-update script assumed that titles are unique. Thus missing 281 records. e.g.

Arctic Biodiversity: Arctic Freshwater Fishes
https://zenodo.org/records/13315783
https://zenodo.org/records/13315751

Africa Tree Database
https://zenodo.org/records/13312623
https://zenodo.org/records/13312619

Fairbairn, 2013
https://zenodo.org/records/13316319
https://zenodo.org/records/13316311

Ramirez, et al, 2008: Ramirez et al, 2008
https://zenodo.org/records/13310465
https://zenodo.org/records/13310461

Only the 2nd record among these pairs were processed.
Anyway, all 281 records missed the last time are now processed as well.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Oct 6, 2024

@KatjaSchulz
Will I also add the tag 'geography' if the existing tags are:

  • EOL Content Partners: National Checklists 2019
  • EOL Content Partners: Water Body Checklists 2019

Or only add 'geography' strictly for values: without " 2019"

  • EOL Content Partners: National Checklists
  • EOL Content Partners: Water Body Checklists

Thanks.

@jhammock
Copy link
Collaborator

jhammock commented Oct 6, 2024

Good question! We mulled that over, but based on the zenodo search tools decided not. We're not confident of being able to filter conveniently to exclude deprecated datasets, so we don't want to give those any other tags.

@eliagbayani
Copy link
Collaborator Author

After a couple of adjustments.
All proposed tag clean-ups here are now implemented. Zenodo.
Thanks.

@jhammock
Copy link
Collaborator

I've started to mess around with tags and metadata and wanted to check something before I make a mess. Eventually, we'll need a mapping of old CKAN addresses to their corresponding zenodo addresses in order to update the resource file links in the harvesting layer. I wouldn't say automating this is super important, but if we have such a mapping already or could easily make one it will certainly be useful, and I want to make sure I'm not messing that up. I've started editing the Related Works metadata, adding two things so far:

  • is derived from [the publication or content partner database or whatever, outside EOL]
  • is source of [link to EOL resource page in the publishing layer]
    example
    comments welcome on those choices!

But more urgently, @eliagbayani , I've deleted a few "is supplement to" relationships, (like this one, not yet removed) thinking we only needed them in case of the file upload difficulties we had earlier. However, if those relationships are present on all our zenodo records, and are the easiest way to trace them back to the ckan records, perhaps I should hold off. Please let me know, what you think about that ckan<->zenodo mapping and in particular if I should leave the supplement relationships alone for that or any other reason. I do want to remove them eventually to avoid confusing our zenodo visitors, but there's no great rush.

@eliagbayani
Copy link
Collaborator Author

@jhammock ,
I'm exploring and will get back to your message. Thanks.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Oct 22, 2024

@jhammock

  • your introduction of the relationship "is source of" is a welcome addition. It shows a clear link back to eol.org.
    I can also check if I can do a bulk-update to add the "is source of" relationship.

  • regarding the ckan<->zenodo mapping. I think I already have something like it. Please check this PDF.
    EOL_resource_id_and_Zenodo_id_file.pdf

  • the "is supplement to" relationship is relevant to those records where the file (DwCA) is something that we generate and have a connector for. e.g. FishBase
    I recommend we leave it for these records as I use it to link to our connectors.
    That is, to facilitate auto-update of respective Zenodo record after connector finishes.
    But we can remove it for those we don't have a connector for e.g. Reid et al, 2012

Thanks.

@jhammock
Copy link
Collaborator

Thanks for that quick investigation, Eli! Yes, that mapping looks like it will make the updating of our harvest layer links very easy when the time comes. So the important thing is for me not to bother the is-supplement relationships for the live connector resources. Where's the best place for me to refer to for a list of those? In the Jenkins?

If you can handily automate the is-source relationships, that would be grand; if not, no complaints. Let me know- if it is, I'll remove the ones I've entered manually, so you can make a clean job of the whole collection. That'll only need to be done once, and I'll probably end up removing a few afterwards. Not everything with a resource page in the publishing layer is published, approved, and non-redundant :)

@eliagbayani
Copy link
Collaborator Author

@jhammock ,

@eliagbayani
Copy link
Collaborator Author

@jhammock,
Confirmed, we can add Related Works -> 'is source of' in bulk-updates.
Thanks.

@jhammock
Copy link
Collaborator

Splendid! I'll leave that to you, then. Thanks :)

@eliagbayani
Copy link
Collaborator Author

Splendid! I'll leave that to you, then. Thanks :)

Finally finished adding Related Works -> 'is source of' relationships in Zenodo for all published EOL resources.
Zenodo. Thanks.

@jhammock
Copy link
Collaborator

I've just finished my first round of manual interventions. Hopefully I haven't made too much of a mess and I have learned a few things. To address some of the mess I have made, I hope this is possible:

for all records bearing tag=deprecated and also a Related Works -> 'is source of' relationship, remove the deprecated tag.

That will un-deprecate several hundred GBIF 2019 checklists which I mistakenly thought were the older version. To balance this out, so to speak, it would be great if all the non-2019 GBIF checklists (eg: https://zenodo.org/records/13313155) could be identified and given a "deprecated" tag. These have a few attributes in common that may be helpful, apart from the titles containing the string "Checklists:" w/o "2019". I think they all have Creator=Anne Thessen (the 2019 versions do not) and the sample I checked have create date = 2018-01-something (water bodies) or 2017-11-something (national)

@jhammock
Copy link
Collaborator

We're making some progress on the attribution fields. @eliagbayani , I'd like you to be indicated as a Contributor on any file produced by a connector. Use whatever identifier or format you like. Offhand I'd suggest role=Data Manager, but if you see another value that makes more sense to a professional eye, you know best! I'm hoping the Jenkins and/or your connector repository will help you to inventory all the relevant records.

@jhammock
Copy link
Collaborator

@JRice when you have a moment, could you list all our bespoke data export products to which you hold the keys? I want to put your name on them in zenodo. Offhand I'm thinking you may be implicated in
-the all-traits export
-the taxon ID map
-the media manifest
-the translatewiki export: https://eol.org/data/term_name_translations.json

but some of those may be in Eli's court and there may be others that I've missed.

@jhammock
Copy link
Collaborator

@eliagbayani A curiosity: https://zenodo.org/records/13381012 should, I think, have a related work, "is source of https://eol.org/resources/459". I'm guessing something about our visibility settings either in CKAN or in zenodo prevented that relationship from being populated- but I've been wrong before in identifying our zenodo records. Have I got the right one? If so I'll fill in the relationship manually.

@eliagbayani
Copy link
Collaborator Author

@eliagbayani A curiosity: https://zenodo.org/records/13381012 should, I think, have a related work, "is source of https://eol.org/resources/459". I'm guessing something about our visibility settings either in CKAN or in zenodo prevented that relationship from being populated- but I've been wrong before in identifying our zenodo records. Have I got the right one? If so I'll fill in the relationship manually.

Yes, I assume it being 'restricted' excluded it during the bulk update. Yes we can manually add the isSourceOf relationship URL https://eol.org/resources/459 . Thanks.

@eliagbayani
Copy link
Collaborator Author

@jhammock
Yes, both this and this are doable. Will create the script and will run each as bulk updates.
Thanks.

@JRice
Copy link
Member

JRice commented Nov 20, 2024 via email

@eliagbayani
Copy link
Collaborator Author

@jhammock , question please.
The non-2019 version of the checklists as designed has 'Anne Thessen' as Data Manager.
But the 2019 checklists don't have 'Anne Thessen' as Data Manager anymore, at the moment.
Do we want to add 'Anne Thessen' for 2019 checklists as well?
Thanks.

@jhammock
Copy link
Collaborator

OK, here's an ideal scenario which may not be practical. The 2019 checklists are derivative works based on the non-2019 checklists. Strictly speaking, each 2019 file should have a derived_from relationship to the non-2019 file for the same country or marine region. I think the titles are all strictly matched (with "2019" appended). Does that make it fairly easy? If so, I think that suffices to document Anne's role. If it's not practical, then yes, we should attach her directly to the 2019 checklists in the contributor field with role=data manager. Thanks!

@eliagbayani
Copy link
Collaborator Author

Oh adding the derived_from relationship to the 2019 files is indeed a better solution. I will do that instead.
Thanks.

@eliagbayani
Copy link
Collaborator Author

Updates:
All these records are now un-deprecated. Added 'geography' keyword.

All these records are now tagged as 'deprecated':

And soon all the 2019 records will have isDerivedFrom relation, bulk update for this is currently running.
Thanks.

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Nov 26, 2024

Updates:

  • All 2019 checklists (National and Water Body) now have isDerivedFrom relation.
  • All connector-generated resources now have 'Eli Agbayani' as DataManager.
    Although please report if you see records where I should be excluded or included as DataManager.
    Because not all cases can be covered by a general criteria.

Thanks.

@jhammock
Copy link
Collaborator

For your next trick, @eliagbayani , we are hoping you can plunder the description field througout our zenodo collection, and identify doi strings where they are mentioned. There may be one (as in Jen's dozens of tiny one-source resource files) or multiple dois (as in Katja's grander trait compilations) in the description field. We'd love to have these nice standard identifiers in a structured form and field of their own.

For each doi found, could you please generate a related works record, with relation=References, scheme=doi, and resource type=publication? I think you can put the doi string directly into the identifier field in any of the formats we seem to have used, without further modification. The formats are pretty variable, which might make finding the dois an interesting adventure. I've added a (random?) sample below. If you don't find them all, that's not critical; we'll carry on adding them manually for new resources anyway; we're just a bit daunted at the idea of catching up with all the ones that are already mentioned.

https://doi.org/10.5061/dryad.37pvmcvsj
doi:10.5194/essd-5-259-2013
doi:10.5061/dryad.dv1j5
http://datadryad.org/resource/doi:10.5061/dryad.0sd41
https://doi.org/10.1093/icb/15.2.455
https://doi.org/10.3897/zookeys.189.2043
https://doi.org/10.1016/S0003-9365(87)80069-6
https://doi.org/10.3157/0002-8320(2007)133[167:CANGOM]2.0.CO;2

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Nov 29, 2024

@jhammock
Case like this one, where same DOI was presented in two ways.

[0] => https://doi.org/10.1080/02724634.1997.10010977
[1] => DOI:10.1080/02724634.1997.10010977	

Which format we prefer to save in Related Works?
Thanks.

@eliagbayani
Copy link
Collaborator Author

@jhammock
First try. Related Works, DOI references now available. For review.
Thanks.

@jhammock
Copy link
Collaborator

@jhammock Case like this one, where same DOI was presented in two ways.

[0] => https://doi.org/10.1080/02724634.1997.10010977
[1] => DOI:10.1080/02724634.1997.10010977	

Which format we prefer to save in Related Works? Thanks.

I don't know of a reason to prefer either one- @KatjaSchulz have you a preference?

@jhammock
Copy link
Collaborator

This draft looks great, @eliagbayani ! I did find one case that didn't receive a relationship. Any ideas? It does have the same doi mentioned twice in exactly the same format, if that could be an issue. (I'll edit the description later, but wanted you to see as is)

@eliagbayani
Copy link
Collaborator Author

eliagbayani commented Nov 29, 2024

This draft looks great, @eliagbayani ! I did find one case that didn't receive a relationship. Any ideas? It does have the same doi mentioned twice in exactly the same format, if that could be an issue. (I'll edit the description later, but wanted you to see as is)

Yes, it is weird that this record was excluded in my subset of records with string 'doi' found in description. It is not because of the double entry. Anyway I ran it by itself and it now has DOI in Related Works. Thanks Jen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants