Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ISA-JSON support as input file. #88

Merged
merged 63 commits into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
97d7fda
Initial push of the ISA JSON parser app
Sep 22, 2023
572c49b
Updated isa_study.py:
Sep 22, 2023
72e3c56
Update test file.
Sep 22, 2023
511e0f7
Add method to dump object in a pandas DataFrame and renamed to Ena Ob…
Sep 22, 2023
89675d6
Update EnaStudy:
Sep 22, 2023
8ec21fd
Add other Classes
Sep 24, 2023
057df1f
Replaced test ISA Json by one without pooled samples
Sep 25, 2023
8838c9e
Add additional objects
Sep 26, 2023
0137052
Updated validation
Sep 26, 2023
6493e4e
Implementation of the Ena Samples functionality
Sep 26, 2023
88de533
remove init file
Sep 26, 2023
b1f4c94
Add documentation
Sep 26, 2023
dc2d2ed
- Implement Submission wrapper object
Sep 27, 2023
2ff3ca1
Add Test Class for Ena Study
Sep 27, 2023
1a5cbff
Renamed read-isa-json folder
Sep 27, 2023
60fd7e1
Add init file to the tests folder
Sep 27, 2023
15702ec
Fix failing tests
Sep 27, 2023
76c8ae6
Rename test file to accomodate other class objects
Sep 27, 2023
138c596
Add test for reading an ISA JSON and producing studies
Sep 27, 2023
eb9eed3
Delete read_isa_json folder and move example script
Sep 27, 2023
b0cd314
Worked on parsing experiment data for ENA submissions
Sep 27, 2023
fa4f66d
Implement exporting to dataframe
Sep 28, 2023
479c95a
Change alias prefix of experiments to the samples url. Makes more sense?
Sep 28, 2023
1645248
Get sample alias out of the list
Sep 28, 2023
f19d07d
Implementation of Ena Runs
Sep 28, 2023
2baec94
Implement Submission
Sep 28, 2023
3426dcf
Cleaning up classes
Sep 29, 2023
3a4b4ef
Implement Characteristic class for Ena Sample
Sep 29, 2023
6e8c400
Clean up example script
Sep 29, 2023
766246a
Move clip_off_prefix to the common ena_std_lib module.
Sep 29, 2023
9e009bd
Annotation of the classes and modules
Sep 29, 2023
27abbde
Remove unused imports
Sep 29, 2023
1839994
Fix typo
Oct 3, 2023
40ccb57
Implementation of assay streams for ena runs
Oct 11, 2023
57b04a3
Rearranged EnaSubmission
Oct 11, 2023
eb10676
Prefix is fetched from custom metadata
Oct 12, 2023
ce88034
Sanitize samples + fix sample_alias in experiments
Oct 13, 2023
a5431e3
clean up
Oct 13, 2023
a590710
Replace simple dictionary validation by extensive JSON schema validation
Oct 13, 2023
650a4b1
Replace script by jupyter notebook
Oct 13, 2023
62dd1df
Restructure isa json support (#1)
bedroesb Oct 20, 2023
5e29034
specify pytest version
bedroesb Oct 20, 2023
fcd6fa7
Upload simple test case ISA JSON
Oct 20, 2023
13ac7d2
Fix simple test case
Oct 20, 2023
25462ed
Fix experiment alias in runs table
Oct 20, 2023
298d4eb
Make run alias the process id instead of sample id for sequencing dat…
Nov 8, 2023
5883f7c
Fixed typo in example data
Nov 8, 2023
46b9afe
Adapt `get_parameter_values` for multi-output process
Nov 8, 2023
e9b913d
Move `get_parameter_values` and `fetch_parameters` to shared ena_std_lib
Nov 8, 2023
cafcaad
Implementation of ParameterValues for samples
Nov 8, 2023
6e5e393
Replace NaN in dataframes by empty string
Nov 9, 2023
addc64d
Remove example python notebook
kdp-cloud Nov 9, 2023
02a6cb8
attempt to fix the setup.py
bedroesb Nov 16, 2023
f40f51d
raise error when assay stream is not present
bedroesb Nov 16, 2023
7fcc93b
some typos
bedroesb Nov 16, 2023
bf3b20c
no receipt
bedroesb Nov 16, 2023
c16bf00
Merge branch 'master' of github.com:usegalaxy-eu/ena-upload-cli into dev
bedroesb Nov 20, 2023
81c601e
new version
bedroesb Nov 20, 2023
3a0f27d
update documentation
bedroesb Nov 20, 2023
574b857
doc
bedroesb Dec 15, 2023
2a4d9a8
add isa_json
bedroesb Dec 15, 2023
c6503de
use with statement
bedroesb Dec 18, 2023
abf7b73
fix example
bedroesb Dec 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
.secret.yml
build/
ena_upload_cli.egg-info/
ena_upload/__pycache__/
__pycache__/
77 changes: 22 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@

# ENA upload tool

This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates). The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a [Galaxy tool](https://toolshed.g2.bx.psu.edu/view/iuc/ena_upload/) and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like [usegalaxy.eu](https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload).

## Overview

The metadata should be provided in separate tables corresponding to the following ENA objects:
The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:

* STUDY
* SAMPLE
* EXPERIMENT
* RUN

The program to perform the following actions:
You can set the tool to perform the following actions:

* add: add an object to the archive
* modify: modify an object in the archive
Expand All @@ -29,11 +29,15 @@ After a successful submission, new tsv tables will be generated with the ENA acc

## Tool dependencies

* python 3.5+ including following packages:
* python 3.7+ including following packages:
* Genshi
* lxml
* pandas
* requests
* pyyaml
* openpyxl
* jsonschema


## Installation

Expand All @@ -60,12 +64,14 @@ All supported arguments:
--experiment EXPERIMENT
table of EXPERIMENT object
--run RUN table of RUN object
--data [FILE [FILE ...]]
data for submission
--data [FILE ...] data for submission
--center CENTER_NAME specific to your Webin account
--checklist CHECKLIST
specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
--xlsx XLSX filled in excel template with metadata
--isa_json ISA_JSON BETA: ISA json describing describing the ENA objects
--isa_assay_stream ISA_ASSAY_STREAM
BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams
--auto_action BETA: detect automatically which action (add or modify) to apply when the action column is not given
--tool TOOL_NAME specify the name of the tool this submission is done with. Default: ena-upload-cli
--tool_version TOOL_VERSION
Expand All @@ -88,7 +94,7 @@ To avoid exposing your credentials through the terminal history, it is recommend

### ENA sample checklists

You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on the [ENA website](https://www.ebi.ac.uk/ena/browser/checklists). This website will also describe which Field Names you have to use in the header of your sample tsv table. The Field Names will be automatically mapped in the outputted xml if the correct `--checklist` parameter is given.
You can specify ENA sample checklist using the `--checklist` parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates).

#### Fixed sample columns

Expand All @@ -104,55 +110,11 @@ The command line tool will automatically fetch the correct scientific name based

#### Viral submissions

If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist on the website of ENA to know which values are allowed/possible in the `restricted text` and `text choice` fields.
If you want to submit viral samples you can use the [ENA virus pathogen](https://www.ebi.ac.uk/ena/browser/view/ERC000033) checklist by adding `ERC000033` to the checklist parameter. Check out our [viral example command](#test-the-tool) as demonstration. Please use the [ENA virus pathogen](https://github.com/ELIXIR-Belgium/ENA-metadata-templates/tree/main/templates/ERC000033) checklist in our template repo to know what is allowed/possible in the `Controlled vocabulary`fields.

### ENA study, experiment and run tables

Here we list all the possible columns one can have in its study, experiment or run table along with its cardinality and controlled vocabulary (CV).
Currently we refer to the [ENA Webin](https://wwwdev.ebi.ac.uk/ena/submit/webin/) to discover which values are allowed when a controlled vocabulary is used, but this will change in the future.

#### Study tsv table

| Name of column | Cardinality | Documentation | CV |
|---|---|---|---|
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
| title | mandatory | Title of the study as would be used in a publication. | |
| study_type | mandatory | The STUDY_TYPE presents a controlled vocabulary for expressing the overall purpose of the study. | yes |
| study_abstract | mandatory | Briefly describes the goals, purpose, and scope of the Study. This need not be listed if it can be inherited from a referenced publication. | |
| center_project_name | optional | Submitter defined project name. This field is intended for backward tracking of the study record to the submitter's LIMS. | |
| study_description | optional | More extensive free-form description of the study. | |
| pubmed_id | optional | Link to publication related to this study. | |

#### Experiment tsv table

| Name of column | Cardinality | Documentation | CV |
|---|---|---|---|
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
| title | mandatory | Short text that can be used to call out experiment records in searches or in displays. | |
| study_alias | mandatory | Identifies the parent study. | |
| sample_alias | mandatory | Pick a sample to associate this experiment with. The sample may be an individual or a pool, depending on how it is specified. | |
| design_description | mandatory | Goal and setup of the individual library including library was constructed. | |
| spot_descriptor | optional | The SPOT_DESCRIPTOR specifies how to decode the individual reads of interest from the monolithic spot sequence. The spot descriptor contains aspects of the experimental design, platform, and processing information. There will be two methods of specification: one will be an index into a table of typical decodings, the other being an exact specification. This construct is needed for loading data and for interpreting the loaded runs. It can be omitted if the loader can infer read layout (from multiple input files or from one input files). | |
| library_name | optional | The submitter's name for this library. | |
| library_layout | mandatory | LIBRARY_LAYOUT specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified. | yes |
| insert_size | mandatory | Relative distance. | |
| library_strategy | mandatory | Sequencing technique intended for this library | yes |
| library_source | mandatory | The LIBRARY_SOURCE specifies the type of source material that is being sequenced. | yes |
| library_selection | mandatory | Method used to enrich the target in the sequence library preparation | yes |
| platform | mandatory | The PLATFORM record selects which sequencing platform and platform-specific runtime parameters. This will be determined by the Center. | yes |
| instrument_model | mandatory | Model of the sequencing instrument. | yes |
| library_construction_protocol | optional | Free form text describing the protocol by which the sequencing library was constructed. | |


#### Run tsv table

| Name of column | Cardinality | Documentation | CV |
|---|---|---|---|
| alias | mandatory | Submitter designated name for the object. The name must be unique within the submission account. | |
| experiment_alias | mandatory | Identifies the parent experiment. | |
| file_name | mandatory | The name or relative pathname of a run data file. | |
| file_type | mandatory | The run data file model. | yes |
| file_checksum | optional | Checksum of uncompressed file. If not given, the checksum will be calculated based on the data files specified in the --data option | |
Please check out the [template](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.


### Dev instance
Expand All @@ -176,7 +138,7 @@ There are two ways of submitting only a selection of objects to ENA. This is han
| sample_alias_5 | | sample_title_2 | 2697049 | sample_description_2 |


> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, not rows will be submitted! Either leave out the column or add to every row the corect action.
> IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the `--action` parameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.


### Using Excel templates
Expand Down Expand Up @@ -215,7 +177,7 @@ By default the updated tables after submission will have the action `added` in t
## Tool overview

**inputs**:
* metadata tables/excelsheet
* metadata tables/excelsheet/isa_json
* examples in `example_table` and on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
* (optional) define actions in **status** column e.g. `add`, `modify`, `cancel`, `release` (when not given the whole table is submitted)
* to perform bulk submission of all objects, the `aliases ids` in different ENA objects should be in the association where alias ids in experiment object link all objects together
Expand Down Expand Up @@ -262,6 +224,11 @@ By default the updated tables after submission will have the action `added` in t
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx
```

* **Using an ISA JSON**
```
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --isa_json tests/test_data/simple_test_case_v2.json --isa_assay_stream "Ena stream 1"
```

* **Release submission**
```
ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml
Expand Down
2 changes: 1 addition & 1 deletion ena_upload/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.6.4"
__version__ = "0.7.0"
51 changes: 44 additions & 7 deletions ena_upload/ena_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import hashlib
import ftplib
import requests
import json
import uuid
import numpy as np
import re
Expand All @@ -21,6 +22,8 @@
import tempfile
from ena_upload._version import __version__
from ena_upload.check_remote import remote_check
from ena_upload.json_parsing.ena_submission import EnaSubmission


SCHEMA_TYPES = ['study', 'experiment', 'run', 'sample']

Expand Down Expand Up @@ -371,7 +374,7 @@ def get_taxon_id(scientific_name):
taxon_id = r.json()[0]['taxId']
return taxon_id
except ValueError:
msg = f'Oops, no taxon ID avaible for {scientific_name}. Is it a valid scientific name?'
msg = f'Oops, no taxon ID available for {scientific_name}. Is it a valid scientific name?'
sys.exit(msg)


Expand All @@ -390,7 +393,7 @@ def get_scientific_name(taxon_id):
taxon_id = r.json()['scientificName']
return taxon_id
except ValueError:
msg = f'Oops, no scientific name avaible for {taxon_id}. Is it a valid taxon_id?'
msg = f'Oops, no scientific name available for {taxon_id}. Is it a valid taxon_id?'
sys.exit(msg)


Expand All @@ -413,16 +416,15 @@ def submit_data(file_paths, password, webin_id):

except IOError as ioe:
print(ioe)
print("ERROR: could not connect to the ftp server.\
sys.exit("ERROR: could not connect to the ftp server.\
Please check your login details.")
sys.exit()
for filename, path in file_paths.items():
print(f'uploading {path}')
try:
print(ftps.storbinary(f'STOR {filename}', open(path, 'rb')))
except BaseException as err:
print(f"ERROR: {err}")
print("ERROR: If your connection times out at this stage, it propably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
print("ERROR: If your connection times out at this stage, it probably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.")
raise
print(ftps.quit())

Expand Down Expand Up @@ -699,7 +701,7 @@ def process_args():

parser.add_argument('--data',
nargs='*',
help='data for submission',
help='data for submission, this can be a list of files',
metavar='FILE')

parser.add_argument('--center',
Expand All @@ -712,6 +714,13 @@ def process_args():

parser.add_argument('--xlsx',
help='filled in excel template with metadata')

parser.add_argument('--isa_json',
help='BETA: ISA json describing describing the ENA objects')

parser.add_argument('--isa_assay_stream',
nargs='*',
help='BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams')

parser.add_argument('--auto_action',
action="store_true",
Expand Down Expand Up @@ -749,7 +758,7 @@ def process_args():

# check if any table is given
tables = set([args.study, args.sample, args.experiment, args.run])
if tables == {None} and not args.xlsx:
if tables == {None} and not args.xlsx and not args.isa_json:
parser.error('Requires at least one table for submission')

# check if .secret file exists
Expand All @@ -764,6 +773,14 @@ def process_args():
msg = f"Oops, the file {args.xlsx} does not exist"
parser.error(msg)

# check if ISA json file exists
if args.isa_json:
if not os.path.isfile(args.isa_json):
msg = f"Oops, the file {args.isa_json} does not exist"
parser.error(msg)
if args.isa_assay_stream is None :
parser.error("--isa_json requires --isa_assay_stream")

# check if data is given when adding a 'run' table
if (not args.no_data_upload and args.run and args.action.upper() not in ['RELEASE', 'CANCEL']) or (not args.no_data_upload and args.xlsx and args.action.upper() not in ['RELEASE', 'CANCEL']):
if args.data is None:
Expand Down Expand Up @@ -816,6 +833,8 @@ def main():
secret = args.secret
draft = args.draft
xlsx = args.xlsx
isa_json_file = args.isa_json
isa_assay_stream = args.isa_assay_stream
auto_action = args.auto_action

with open(secret, 'r') as secret_file:
Expand Down Expand Up @@ -857,6 +876,24 @@ def main():
schema_dataframe[schema] = xl_sheet
path = os.path.dirname(os.path.abspath(xlsx))
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"
elif isa_json_file:
# Read json file
isa_json = json.load(open(isa_json_file))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file handle is here kept open, maybe use a with open statement?


schema_tables = {}
schema_dataframe = {}
required_assays = []
for stream in isa_assay_stream:
required_assays.append({"assay_stream": stream})
submission = EnaSubmission.from_isa_json(isa_json, required_assays)
submission_dataframes = submission.generate_dataframes()
for schema, df in submission_dataframes.items():
schema_dataframe[schema] = check_columns(
df, schema, action, dev, auto_action)
path = os.path.dirname(os.path.abspath(isa_json_file))
schema_tables[schema] = f"{path}/ENA_template_{schema}.tsv"


else:
# collect the schema with table input from command-line
schema_tables = collect_tables(args)
Expand Down
Empty file.
Loading
Loading