Refactor the BIA retrieval tool #11

kostrykin · 2024-07-18T18:25:25Z

There is a tool for downloading images from the Bioimage Archive:
https://imaging.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/bia_download/bia_download/0.1.0+galaxy0

The UI of this tool needs some love:

The "storage mode" can only be nfs or fire, this should thus be a dropdown field.
The help text of the field should say when to use which of the two options.
The "path of accession" field is unclear — what does a user need to put in here? Add a help text.
Why are both fields optional? Is that correct?
Is it possible to retrieve only part of a dataset?

Optional:

If it is possible to retrieve only part of a dataset, this should be implemented.
Migrate to the alpha version of the API? Stable API could be available by now (Beatriz should have further info).

The text was updated successfully, but these errors were encountered:

B0r1sD · 2024-11-04T16:04:16Z

I'm looking into this task.

Tool source: tools/image_processing/bia-ftplinks

Is it helpful to add this tool to the IUC tool repository, so we also make use of their tests and best practices?

B0r1sD · 2024-11-04T17:52:22Z

Started tracking the progress in this draft PR: bgruening/galaxytools#1541

Added EDAM ontology:
- https://edamontology.github.io/edam-browser/#topic_3383
- https://edamontology.github.io/edam-browser/#operation_2422
  As this is my first time using it, a double check from someone would be nice!
This point is not clear to me yet:

The "storage mode" can only be nfs or fire, this should thus be a dropdown field.

The help text of the field should say when to use which of the two options.

but I have found info about it in the following places:

And some context:
FIRE stands for FIle REplication, EMBL-EBI’s very large-scale object data storage system. This provides long-term sustainable storage, operational redundancy, and backup to tape. Dataset level metadata are stored in a MongoDB database. The system backend is coded in Kotlin.

They are essential, added.

Why are both fields optional? Is that correct?

kostrykin · 2024-11-05T08:33:40Z

Thanks @B0r1sD!

This point is not clear to me yet:

The "storage mode" can only be nfs or fire, this should thus be a dropdown field.

Right now "storage mode" is a text field, but according to the help text of the field, only two values are accepted (either nfs or fire). In that case, this field should either be a dropdown field, where one of the two options can be selected? Since any other input value would be invalid and invalid input should be prevented by a good UI.

The help text of the field should say when to use which of the two options.

To me and @beatrizserrano it wasn't immediately clear how to determine the correct value for this input. I see from your explanations what either of the two is, but still, how is the user supposed to determine the correct value for input here? Can we add a help text here to provide some guidance?

B0r1sD · 2024-11-05T11:35:11Z

Mail sent to BIA to ask about nfs/fire, exert from reply:

To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear.
You can find out what is the storage mode for a dataset using this command: 
curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink

Other checklist points addressed and committed to the PR.
Looked into downloading subsets via wget. It should be possible, but have to look into how to implement it. Presumably let the user input a list of subsets and then implementing the download script from BIA in the wrapper.
.
Example of how the downloaded script looks like:

#!/bin/bash
# Run this file in bash with this command:  ./filename
HOST=ftp.ebi.ac.uk
USER=anonymous
ftp -pinv $HOST <<EOF
user $USER
cd biostudies/fire/S-BIAD/458/S-BIAD1458/Files
binary
mget "Red blood cell differential image data/data/0-0.3/0(11).jpg"
mget "Red blood cell differential image data/data/0-0.3/0(2).jpg"
disconnect
bye
EOF

kostrykin · 2024-11-05T13:10:04Z

Mail sent to BIA to ask about nfs/fire, exert from reply:

To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear.
You can find out what is the storage mode for a dataset using this command: >
curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink

Cool can we use the curl command in the tool wrapper to determine the correct mode automatically?

B0r1sD · 2024-11-05T15:54:24Z

Yeah that would be ideal.

kostrykin · 2024-11-05T20:21:29Z

Yeah that would be ideal.

Let me know if you need any help!

kostrykin · 2024-11-05T20:27:47Z

@B0r1sD What's your current state? We need to report our state tomorrow. It would be ideal if you could tick the boxes! 🥳

B0r1sD · 2024-11-07T11:17:05Z

Current state: we're in talks with folks from BIA to add a button on their website to seamlessly integrate a data retrieval method similar to some 'Get Data' tools (UCSC, EBI SRA,...).

In the meanwhile, I got an answer about the FIRE/NFs:

right now all our studies have been migrated to FIRE storage. However, we are introducing a new feature that will use NFS as a storage option again. This will mean that soon we’ll have data on NFS and FIRE soon, so you probably want to keep that in mind.

So we decided to keep the dropdown but let the FIRE option be default and thoroughly explain why there are two options (and when to choose what).
In the future, when they will reuse NFS, we can look into integrating the curl + jq command that checks if it's FIRE or NFS. This command worked:

curl "https://www.ebi.ac.uk/biostudies/api/v1/studies/S-BIAD570/info" -s | jq '.. | .ftpLink? // empty'

But makes use of their API that is in alpha.

Via the ftp link, the nfs/fire information is not directly included (could get found later if we change the wrapper).

curl "https://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/570/S-BIAD570/" -s

B0r1sD · 2024-11-07T11:24:33Z

Information communicated to the BIA:

Technical details

General flow

User starts at Galaxy, gets sent to external resource with 'GALAXY_URL' parameter.
User browses external site and selects options, sends data to Galaxy by using the GALAXY_URL parameter, providing with it a URL parameter that contains where Galaxy should inform the external site of the final GALAXY_URL
Galaxy contacts 'URL', with a new GALAXY_URL (the page content of accessing 'URL' should end with 'OK')
When data is ready, the external site contacts the new GALAXY_URL, providing 'URL' which contains final data and 'STATUS' which should be 'OK' (when successful)
Data is loaded into the Galaxy history.

Depending on how the data is fetched at your end, the depositing of data should be either implemented synchronously (docs) or asynchronously (docs). The synchronous implementation is less complex but depending on your backend, this could simply not be an option. @wm75 is an expert in this implementation so can provide technical support where needed.

Code implementation example(s)

The following Github repo contains the example scripts for the implementation on 3 different Python web framework (Cherrypy, Django, Flask): https://github.com/hexylena/galaxy-data_source-examples
The lines of code Björn was referring to would look like this in Cherrypy: https://github.com/hexylena/galaxy-data_source-examples/blob/main/cherrypy/server.py, that also comes with documentation: https://github.com/hexylena/galaxy-data_source-examples/tree/main/cherrypy#overview.

Examples

Below is an example how this feature was implemented by the UCSC for their Tablebrowser, from both perspectives.

Data(base) side

Below, two examples of active implementations are shown, which is the relevant perspective for your team.

-UCSC Tableviewer-

Search for the UCSC Main tool at e.g. https://usegalaxy.eu/

The tool redirects to the following link, where you can see the GALAXY_URL parameter:
https://genome.ucsc.edu/cgi-bin/hgTables?GALAXY_URL=https%3A//usegalaxy.eu/tool_runner&tool_id=ucsc_table_direct1&sendToGalaxy=1&hgta_compressType=none&hgta_outputType=bed
Here you can see how UCSC implemented the option buttons on their webpage.

-EBI SRA-

A video (from 2015) showing the workflow and how EBI implemented this on their side for the European Short Read Archive:

https://vimeo.com/121187220

https://usegalaxy.eu/tool_runner/data_source_redirect?tool_id=ebi_sra_main

Galaxy side

This is how the Galaxy tool (or 'wrapper') would look like on Galaxy's side: XML file example for the UCSC Tablebrowser: https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/ucsc_tablebrowser.xml. More technical information on this tool of the 'data source' type can be found here: https://docs.galaxyproject.org/en/latest/dev/data_source.html. This is something we would develop and provide.

B0r1sD · 2024-11-07T11:37:54Z

The retrieval tool also only works for studies that are part of BioImages - Core collection (with an accession that looks like S-BIAD0000). This is not the only study collection on there so I will document this in the wrapper for now and see how the seamless integration button progresses as this would make this tool obsolete (so I don't see the point now to implement an error catch or feature that works with all types of studies e.g. S-JCBD-201709074).

B0r1sD · 2024-11-08T09:25:25Z

Having some issues serving the tool locally, the last change is a more verbose help section which I will add later:

Storage mode
FIle REplication or FIRE is EMBL-EBI’s very large-scale object data storage system. At the moment of writing, all their studies have been migrated to FIRE storage hence it being the default option. However, they are introducing a new feature that will use NFS as a storage option, so the study you are referring to might live on NFS in the near future. This is the reason both option are available.
Accession number:
This tool only supports studies part of the 'BioImages - Core' collection (with an accession number that follows the S-BIAD0000 pattern).

kostrykin added this to BH2024 Jul 18, 2024

kostrykin converted this from a draft issue Jul 18, 2024

kostrykin added the tools dev This issue involves tool development label Jul 18, 2024

beatrizserrano mentioned this issue Oct 3, 2024

Create workflow for AI-based image analysis #3

Open

4 tasks

kostrykin assigned B0r1sD Nov 4, 2024

B0r1sD linked a pull request Nov 4, 2024 that will close this issue

[BH2024] Update Bioimaging Archive retrieval tool bgruening/galaxytools#1541

Draft

kostrykin mentioned this issue Nov 6, 2024

Load images from the BIA (GTN) #13

Open

kostrykin added the BHEU24 Has been worked on at the BHEU2024 label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the BIA retrieval tool #11

Refactor the BIA retrieval tool #11

kostrykin commented Jul 18, 2024 •

edited

Loading

B0r1sD commented Nov 4, 2024

B0r1sD commented Nov 4, 2024

kostrykin commented Nov 5, 2024

B0r1sD commented Nov 5, 2024

kostrykin commented Nov 5, 2024

B0r1sD commented Nov 5, 2024

kostrykin commented Nov 5, 2024

kostrykin commented Nov 5, 2024

B0r1sD commented Nov 7, 2024

B0r1sD commented Nov 7, 2024

B0r1sD commented Nov 7, 2024

B0r1sD commented Nov 8, 2024

Refactor the BIA retrieval tool #11

Refactor the BIA retrieval tool #11

Comments

kostrykin commented Jul 18, 2024 • edited Loading

B0r1sD commented Nov 4, 2024

B0r1sD commented Nov 4, 2024

kostrykin commented Nov 5, 2024

B0r1sD commented Nov 5, 2024

kostrykin commented Nov 5, 2024

B0r1sD commented Nov 5, 2024

kostrykin commented Nov 5, 2024

kostrykin commented Nov 5, 2024

B0r1sD commented Nov 7, 2024

B0r1sD commented Nov 7, 2024

Technical details

Code implementation example(s)

Examples

Data(base) side

Galaxy side

B0r1sD commented Nov 7, 2024

B0r1sD commented Nov 8, 2024

kostrykin commented Jul 18, 2024 •

edited

Loading