Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many changes #149

Merged
merged 13 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/fetch_all_tools.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Run script
run: |
python ./bin/get_public_galaxy_servers.py -o data/available_public_servers.csv
python bin/get_public_galaxy_servers.py -o data/available_public_servers.csv
- name: Commit servers
# add or commit any changes in results if there was a change, merge with main, and push as bot
run: |
Expand Down Expand Up @@ -59,7 +59,7 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Run script #needs PAT to access other repos
run: |
bash ./bin/extract_all_tools_stepwise.sh "${{ matrix.subset }}"
bash bin/extract_all_tools.sh "${{ matrix.subset }}"
env:
GITHUB_API_KEY: ${{ secrets.GH_API_TOKEN }}
- name: Commit all tools
Expand Down Expand Up @@ -92,7 +92,7 @@ jobs:
jq -s 'map(.[])' results/repositories*.list_tools.json > results/all_tools.json
- name: Wordcloud and interactive table
run: |
bash ./bin/extract_all_tools_downstream.sh
bash bin/format_tools.sh
- name: Commit all tools
# add or commit any changes in results if there was a change, merge with main and push as bot
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/fetch_all_tutorials.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Run script #needs PAT to access other repos
run: |
bash ./bin/extract_all_tutorials.sh
bash bin/extract_all_tutorials.sh
env:
PLAUSIBLE_API_KEY: ${{ secrets.PLAUSIBLE_API_TOKEN }}
- name: Commit all tools
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/filter_communities.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Run script
run: |
bash ./bin/get_community_tutorials.sh
bash bin/get_community_tutorials.sh
- name: Commit results
# commit the new filtered data, only if stuff was changed
run: |
Expand All @@ -59,7 +59,7 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Run script
run: |
bash ./bin/update_tools_to_keep_exclude.sh
bash bin/update_tools_to_keep_exclude.sh
- name: Commit results
# commit the new filtered data, only if stuff was changed
run: |
Expand All @@ -82,7 +82,7 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Run script
run: |
bash ./bin/get_community_tools.sh
bash bin/get_community_tools.sh
- name: Commit results
# commit the new filtered data, only if stuff was changed
run: |
Expand Down
44 changes: 6 additions & 38 deletions .github/workflows/run_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,39 +13,16 @@ jobs:
- name: Install requirement
run: python -m pip install -r requirements.txt
- name: Tool extraction
# run: bash bin/extract_all_tools.sh
run: |
python bin/extract_galaxy_tools.py \
extractools \
--api $GITHUB_API_KEY \
--all-tools "results/test_tools.tsv" \
--all-tools-json "results/test_tools.json" \
--planemo-repository-list "test.list" \
--test
bash bin/extract_all_tools.sh test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did u switch back to the bash script, does set -e work to make it stop on python error ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have only one location to modify when we modify the Python script and avoid forgetting

env:
GITHUB_API_KEY: ${{ secrets.GH_API_TOKEN }}
- name: Tool filter
run: |
python bin/extract_galaxy_tools.py \
filtertools \
--tools "results/all_tools.json" \
--ts-filtered-tools "results/microgalaxy/tools_filtered_by_ts_categories.tsv" \
--filtered-tools "results/microgalaxy/tools.tsv" \
--categories "data/communities/microgalaxy/categories" \
--status "data/communities/microgalaxy/tool_status.tsv"
- name: Create interactive table
bash bin/get_community_tools.sh test
- name: Create interactive table and wordcloud
run: |
python bin/create_interactive_table.py \
--table "results/microgalaxy/tools.tsv" \
--template "data/interactive_table_template.html" \
--output "results/microgalaxy/index.html"
- name: Create wordcloud
run: |
python bin/create_wordcloud.py \
--table "results/microgalaxy/tools.tsv" \
--wordcloud_mask "data/usage_stats/wordcloud_mask.png" \
--output "results/microgalaxy/tools_wordcloud.png" \
--stats_column "No. of tool users (2022-2023) (usegalaxy.eu)"
bash bin/format_tools.sh
test-tutorials:
runs-on: ubuntu-20.04
steps:
Expand All @@ -57,18 +34,9 @@ jobs:
run: python -m pip install -r requirements.txt
- name: Tutorial extraction
run: |
python bin/extract_gtn_tutorials.py \
extracttutorials \
--all_tutorials "results/test_tutorials.json" \
--tools "results/all_tools.json" \
--api $PLAUSIBLE_API_KEY \
--test
bash bin/extract_all_tutorials.sh test
env:
PLAUSIBLE_API_KEY: ${{ secrets.PLAUSIBLE_API_TOKEN }}
- name: Tutorial filtering
run: |
python bin/extract_gtn_tutorials.py \
filtertutorials \
--all_tutorials "results/test_tutorials.json" \
--filtered_tutorials "results/microgalaxy/test_tutorials.tsv" \
--tags "data/communities/microgalaxy/tutorial_tags"
bash bin/get_community_tutorials.sh test
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
.DS_Store
__pycache__
__pycache__
results/test*
results/*/test*
184 changes: 132 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,13 @@
Galaxy Tool Metadata Extractor
=====================
Galaxy Codex
============

# What is the tool doing?
Galaxy Communities Dock aka Galaxy Codex is a catalog of Galaxy resources (tools, training, workflows) that can be filtered for any community.

![plot](docs/images/Preprint_flowchart.png)


This tool automatically collects a table of all available Galaxy tools including their metadata. Therefore, various sources are parsed to collect the metadata, such as:
* github (parsing each tool wrapper)
* bio.tools
* bioconda
* Galaxy instances (availability, statistics)
This repository stores the sources to build this catalog. The catalog is automatically updated every week.

The created table can be filtered to only show the tools relevant for a specific community.

Any Galaxy community can be added to this project and benefit from a dedicated interactive table that can be embedded into subdomains and website via an iframe. **Learn [how to add your community](https://training.galaxyproject.org/training-material//topics/dev/tutorials/community-tool-table/tutorial.html) in the dedicated GTN toturial**.

The interactive table benefits from EDAM annotations of the tools, this requires, that the tools are annotation via bio.tools.
**Learn [how to improve metadata for Galaxy tools using the bio.tools registry](https://training.galaxyproject.org/training-material//topics/dev/tutorials/tool-annotation/tutorial.html)**.
Any Galaxy community can be added to this project and benefit from the dedicated resources, including interactive tables that can be embedded into subdomains and website via an iframe. **Learn [how to add your community](https://training.galaxyproject.org/training-material//topics/dev/tutorials/community-tool-table/tutorial.html) in the dedicated GTN toturial**.

# Tool workflows

The tool performs the following steps:

- Parse tool GitHub repository from [Planemo monitor listed](https://github.com/galaxyproject/planemo-monitor)
- Check in each repo, their `.shed.yaml` file and filter for categories, such as metagenomics
- Extract metadata from the `.shed.yaml`
- Extract the requirements in the macros or xml to get version supported in Galaxy
- Check available against conda version
- Extract bio.tools information if available in the macros or xml
- Check available on the 3 main galaxy instances (usegalaxy.eu, usegalaxy.org, usegalaxy.org.au)
- Get usage statistics form usegalaxy.eu
- Creates an interactive table for all tools: [All tools](https://galaxyproject.github.io/galaxy_tool_metadata_extractor/)
- Creates an interactive table for all registered communities, e.g. [microGalaxy](https://galaxyproject.github.io/galaxy_tool_metadata_extractor/microgalaxy/)

# Usage

## Prepare environment
# Prepare environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this below Extract all tools outside a GitHub Action. And any objection to use conda instead of virtualenv ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And any objection to use conda instead of virtualenv ?

As we could have both virtualenv and conda documented. I usually favor virtualenv when we have Python only project (even if I use it within a conda env 😅) because people might prefer avoid using conda then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this below Extract all tools outside a GitHub Action.

I get the point. The only thing is it is useful for Training too


- Install virtualenv (if not already there)

Expand All @@ -62,9 +33,39 @@ The tool performs the following steps:
$ python3 -m pip install -r requirements.txt
```

## Tools
# Extract Galaxy Tool Suites

### Extract all tools
![plot](docs/images/Preprint_flowchart.png)

This tool automatically collects a table of all available Galaxy tool suites including their metadata. Therefore, various sources are parsed to collect the metadata, such as:
* GitHub (parsing each tool wrapper)
* bio.tools
* Bioconda
* Galaxy server (availability, statistics)

The created table can be filtered to only show the tools relevant for a specific community.

The tool table benefits from EDAM annotations of the tools, this requires, that the tools are annotation via bio.tools.
**Learn [how to improve metadata for Galaxy tools using the bio.tools registry](https://training.galaxyproject.org/training-material//topics/dev/tutorials/tool-annotation/tutorial.html)**.

## Extract tool suites and filter per community automatically

A GitHub action performs every week the following steps:

- Extract all tools by
1. Parsing tool GitHub repository from [Planemo monitor listed](https://github.com/galaxyproject/planemo-monitor)
2. Checking in each repo, their `.shed.yaml` file and filter for categories, such as metagenomics
3. Extracting metadata from the `.shed.yaml`
4. Extracting the requirements in the macros or xml to get version supported in Galaxy
5. Checking available against conda version
6. Extracting bio.tools information if available in the macros or xml
7. Checking available on the 3 main galaxy instances (usegalaxy.eu, usegalaxy.org, usegalaxy.org.au)
8. Getting usage statistics form usegalaxy.eu
- Create an interactive table for all tools: [All tools](https://galaxyproject.github.io/galaxy_tool_metadata_extractor/)
- Filter the tool suite per community
- Create an interactive table for all registered communities, e.g. [microGalaxy](https://galaxyproject.github.io/galaxy_tool_metadata_extractor/microgalaxy/)

## Extract all tools outside a GitHub Action

1. Get an API key ([personal token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)) for GitHub
2. Export the GitHub API key as an environment variable:
Expand All @@ -73,7 +74,7 @@ The tool performs the following steps:
$ export GITHUB_API_KEY=<your GitHub API key>
```

3. Run the script
3. Run the script to extract all tools

```
$ python bin/extract_all_tools.sh
Expand All @@ -98,7 +99,7 @@ The script will generate a TSV file with each tool found in the list of GitHub r
15. Conda id
16. Conda version

### Filter tools based on their categories in the ToolShed
## Filter tools based on their categories in the ToolShed outside a GitHub Action

1. Run the extraction as explained before
2. (Optional) Create a text file with ToolShed categories for which tools need to be extracted: 1 ToolShed category per row ([example for microbial data analysis](data/microgalaxy/categories))
Expand All @@ -113,16 +114,33 @@ The script will generate a TSV file with each tool found in the list of GitHub r

```
$ python bin/extract_galaxy_tools.py \
--tools <Path to JSON file with all extracted tools> \
--ts-filtered-tools <Path to output TSV with tools filtered based on ToolShed category>
--filtered-tools <Path to output TSV with filtered tools based on ToolShed category and manual curation> \
filter \
--all <Path to JSON file with all extracted tools> \
--ts-filtered <Path to output TSV with tools filtered based on ToolShed category>
--filtered <Path to output TSV with filtered tools based on ToolShed category and manual curation> \
[--categories <Path to ToolShed category file>] \
[--status <Path to a TSV file with tool status - 3 columns: ToolShed ids of tool suites, Boolean with True to keep and False to exclude, Boolean with True if deprecated and False if not>]
```

## Training
# Training

Materials are extracted from the Galaxy Training Network and extended with information from Plausible (visits), YouTube (views), feedback and tools.

## Extract training material and filter per community automatically

### Extract tutorials from GTN
A GitHub action performs every week the following steps:

- Extract all training by
1. Parsing the GTN API
2. Adding EDAM operations from the tools used in the tutorial
3. Adding visit stats using the Plausible API
4. Adding video view stats using YouTube API
5. Adding feedback from the GTN API
- Create an interactive table for all tutorials
- Filter the training per community based on tags
- Create an interactive table for all registered communities

## Extract tutorials from GTN outside a GitHub Action

1. Get an API key ([personal token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens)) for Plausible
2. Export the Plausible API key as an environment variable:
Expand All @@ -137,26 +155,88 @@ The script will generate a TSV file with each tool found in the list of GitHub r
$ python bin/extract_all_tutorials.sh
```

### Filter tutorials based on tags
## Filter tutorials based on tags outside a GitHub Action

1. Run the extraction as explained before
2. Create a file named `tutorial_tags` in your community `data` folder with the list of tutorial tags to keep
3. Run the following command

```
$ python bin/extract_gtn_tutorials.py \
filtertutorials \
--all_tutorials "results/all_tutorials.json" \
--filtered_tutorials "results/<your community>/tutorials.tsv" \
filter\
--all "results/all_tutorials.json" \
--filtered "results/<your community>/tutorials.tsv" \
--tags "data/communities/<your community>/tutorial_tags"
```

## Development
## Export

To make a test run of the tool to check its functionalities follow [Usage](#Usage) to set-up the environnement and the API key, then run
### Generate wordcloud

Example to generate a wordcloud for the Galaxy tool suites with size of names of tool suites depends on the number of tool users in 2022-2023 on usegalaxy.eu:

![](results/all_tools_wordcloud.png)

```bash
$ python bin/create_wordcloud.py \
--table "results/all_tools.tsv" \
--name_col "Galaxy wrapper id" \
--stat_col "No. of tool users (2022-2023) (usegalaxy.eu)" \
--wordcloud_mask "data/usage_stats/wordcloud_mask.png" \
--output "results/all_tools_wordcloud.png" \
```

### Create interactive table in HTML

Example to generate an HTML file with an interactive table with microGalaxy tools that should be kept (`True` in `To keep` column)

```bash
bash ./bin/extract_all_tools_test.sh test.list
$ python bin/create_interactive_table.py \
--table "results/microgalaxy/tools.tsv" \
--remove-col "Reviewed" \
--remove-col "To keep" \
--filter-col "To keep" \
--template "data/interactive_table_template.html" \
--output "results/microgalaxy/index.html"
```

This runs the tool, but only parses the test repository [Galaxy-Tool-Metadata-Extractor-Test-Wrapper](https://github.com/paulzierep/Galaxy-Tool-Metadata-Extractor-Test-Wrapper)
## Development

### Tools

To make a test run of the tool to check its functionalities follow [Usage](#Usage) to set-up the environnement and the API key, then run

1. Tool extraction

```bash
$ bash bin/extract_all_tools.sh test
```

This runs the tool, but only parses the test repository [Galaxy-Tool-Metadata-Extractor-Test-Wrapper](https://github.com/paulzierep/Galaxy-Tool-Metadata-Extractor-Test-Wrapper)

2. Tool filter

```bash
$ bash bin/get_community_tools.sh test
```

3. Create interactive table and wordcloud

```bash
$ bash bin/format_tools.sh
```

### Tutorials

1. Tutorial extraction

```bash
$ bash bin/extract_all_tutorials.sh test
```

2. Tutorial filtering

```bash
$ bash bin/get_community_tutorials.sh test
```

Loading