ensembl-tui provides the eti
terminal application for obtaining a subset of the data provided by Ensembl which can then be queried locally. You can have multiple such subsets on your machine, each corresponding to a different selection of species and data types.
Warning ensembl-tui is in a preliminary phase of development with a limited feature set and incomplete test coverage! We currently only support accessing data from the main ensembl.org site. Please validate results against the web version. If you discover errors, please post a bug report.
General user installation instructions
$ pip install ensembl-tui
Developer installation instructions
Fork the repo and clone your fork to your local machine. In the terminal, create either a python virtual environment or a new conda environment and activate it. In that virtual environment$ pip install flit
Then do the flit version of a "developer install". (It is basically creating a symlink to the repos source directory.)
$ flit install -s --python `which python`
Ensembl hosts some very large data sets. You need to have a machine with sufficient disk space to store the data you want to download. At present we do not have support for predicting how much storage would be required for a given selection of species and data types. You will need to experiment.
Some commands can be run in parallel but have moderate memory requirements. If you have a machine with limited RAM, you may need to reduce the number of parallel processes. Again, run some experiments.
Specifying what data you want to download and where to put it
We use a plain text file to indicate the Ensembl domain, release and types of genomic data to download. Start by using the exportrc
subcommand.
Usage: eti exportrc [OPTIONS]
exports sample config and species table to the nominated path
Options:
-o, --outpath PATH Path to directory to export all rc contents.
--help Show this message and exit.
$ eti exportrc -o ~/Desktop/Outbox/ensembl_download
This command creates a ensembl_download
download directory and writes two plain text files into it:
species.tsv
: contains the Latin names, common names etc... of the species accessible at ensembl.org website.sample.cfg
: a sample configuration file that you can edit to specify the data you want to download.
The latter file includes comments on how to edit it in order to specify the genomic resources that you want.
Downloading the data
Downloads the data indicated in the config file to a local directory.Usage: eti download [OPTIONS]
download data from Ensembl's ftp site
Options:
-c, --configpath PATH Path to config file specifying databases, (only species
or compara at present).
-d, --debug Maximum verbosity, and reduces number of downloads,
etc...
-v, --verbose
--help Show this message and exit.
For a config file named config.cfg
, the download command would be:
$ cd to/directory/with/config.cfg
$ eti download -c config.cfg
Note Downloads can be interrupted and resumed. The software deletes partially downloaded files.
The download creates a new .cfg
file inside the download directory. This file is used by the install
command.
Installing the data
Usage: eti install [OPTIONS]
create the local representations of the data
Options:
-d, --download PATH Path to local download directory containing a cfg
file.
-np, --num_procs INTEGER Number of procs to use. [default: 1]
-f, --force_overwrite Overwrite existing data.
-v, --verbose
--help Show this message and exit.
The following command uses 2 CPUs and has been safe on systems with only 16GB of RAM for 10 primate genomes, including homology data and whole genome:
$ cd to/directory/with/downloaded_data
$ eti install -d downloaded_data -np 2
Checking what has been installed
Usage: eti installed [OPTIONS]
show what is installed
Options:
-i, --installed TEXT Path to root directory of an installation. [required]
--help Show this message and exit.
We provide a conventional command line interface for querying the data with subcommands.
The full list of subcommands
You can get help on individual subcommands by running eti <subcommand>
in the terminal.
Usage: eti [OPTIONS] COMMAND [ARGS]...
Tools for obtaining and interrogating subsets of https://ensembl.org genomic
data.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
alignments export multiple alignments in fasta format for named genes
download download data from Ensembl's ftp site
dump-genes export meta-data table for genes from one species to...
exportrc exports sample config and species table to the nominated...
homologs exports CDS sequence data in fasta format for homology...
install create the local representations of the data
installed show what is installed
species-summary genome summary data for a species
tui Open Textual TUI.
We also provide an experiment terminal user interface (TUI) that allows you to explore the data in a more interactive way. This is invoked with the tui
subcommand.