This repository contains the source code that was used to create RaMP - a Relation database of Metabolic Pathways (https://www.ncbi.nlm.nih.gov/pubmed/29470400). RaMP is a conglomerate of 4 different databases (KEGG, Reactome, HMDB, and WikiPathways), which includes pathways and annotations for genes and metabolites. The database also includes chemical properties and chemical class annotations from HMDB, ChEBI and LIPIDMAPS.
These scripts are used to build the RaMP database from scratch. If you are looking for the RaMP mysql dump, you will find it here: https://github.com/ncats/RaMP-DB/tree/master/inst/extdata If you are looking to access RaMP through our shiny app, look here: https://github.com/ncats/RaMP-DB/
Here is the overall workflow for getting the mySQL database up and running:
- git clone this repository
- Run main.py successfully (purpose is to create intermediate data files in the misc/output folder for each data type.) Reading resource files from the various data sources is memory intensive.
- Run EntityBuilder.py to merge data to build entity relationships. This produces a file for each table in the RaMP DB Schema in the /misc/sql/loader.
- Create an empty database in mySQL (e.g. called ramp)
- Import the sql files into the database using src/util/rampDBBulkLoader. See code section for details: DB Loading Code
- You can now query the database!
Keep scrolling down for the details...
RaMP was built using using python 3.8, which must be downloaded here: https://www.python.org/downloads/ Make certain that you have python 3.8 or later.
Using an IDE makes writing code much simpler.
One option is to use pyDev for Eclipse for python:
- download Eclipse: https://www.eclipse.org/downloads/
- install the pyDev software into Eclipse: http://www.wikihow.com/Install-Pydev-on-Eclipse
The package is being documented using Sphinx: https://samnicholls.net/2016/06/15/how-to-sphinx-readthedocs/ and http://autoapi.readthedocs.io/
Set up requires adding the Sphinx to your PYTHONPATH. Traditionally, this would mean adding a line to your bash_rc. However if you are using eclipse to run code it doesn't register the PYTHONPATH set by the bash_rc. You will need to add the PYTHONPATH to the project using the eclipse interface. Here are the steps:
- Import Sphinx into eclipse.
- Right click the package in the left-hand pane and select properties.
- Select "Pydev - PYTHONPATH" from the options.
- Click "add source" button and select that src directory of the project
- Click apply and ok buttons.
main.py
The main script to build the database is 'main.py'. This script calls the other classes and necessary code to build the database.
The first time 'main.py' is run, one must swith the "getDatabaseFiles" parameter to "True". Setting this parameter to true will download the content of the databases (HMDB, Reactome, WikiPathways) on your computer. By default, we have set "getDatabasefiles" to FALSE because you may need to gather the information from each database again if you add new functionalities to the scripts (in which case there is no need to download the original files again). Note that the Reactome resource updates monthly and changes the name of their files to add the date. The src/parse/reactomeData.py file's getDatabaseFile method contains the variable to set.
The output of running 'main.py' are intermediate files that are created in data-source specific files in misc/output.
The next step is building entities and final files for database loading. The files in misc/output are input into an entity harmonization process that enforces entity curation patches, aggregates duplicate gene and compound entities from different data soruces and builds entity relationships. This step is kicked off by running /src/util/EntityBuilder.py. This step will refer to a text file that contains manual curation results that captures known problems in entity mappings held in certain data sources. The compound curation list prevents incorrect mappings between compounds from being introduced to the RaMP database.
importing sql tables
Import the sql files into the database using src/util/rampDBBulkLoader. See code section for details: DB Loading Code
[Additional information coming soon on bringing up mysql and initial creation of the database schema]
You may also need to download mySQL: https://www.mysql.com/downloads/.
The repo contains the following folders
**src**: contains classes used by main.py (most of the code)
**main**: contains main.py -- uses the classes in src (code)
**docs**: documentation/user manual
**tests**: unit testing
**misc/data**: folders for files from hmdb, wikipathways, kegg, and reactome stored here
**misc/sql**: .sql files output from running main.py. These .sql files make up the mySQL RaMP database
**misc/output**: any output file that is not a .sql file -- some functions have other outputs, such as venndiagrams or lists of converted genes (can help with debugging)
**misc/queryMySQL**: place to keep files that query the mySQL database once it already exists
Note: when running main.py, the data, output, and sql folders contain lots of data which is not included with this repo due to size restrictions.
The first update to the RaMP (01/24/2019):
- The major change is to adapt to the recent HMDB updates which changed the hierarchy of Ontology.
- Exclusion of KEGG database.
- Handled few compound mismatch that had occurred in the old version.
- John Braisted ([email protected])
- Ewy Mathé ([email protected])
Elizabeth Baskin, Senyang Hu, and Bofei Zhang were also involved in developing this code.