KnowEnG's Data Cleanup Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Data Cleanup Pipeline. This pipeline cleanup the data of a given spreadsheet for subsequent processing by KnowEnG Analytics Platform.

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

After removing empty rows and columns for user spreadsheet data, check :

if spreadsheet is empty, reject.
if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only non-negative real value. If not, reject.
if spreadsheet contains NaN value, reject.
if spreadsheet contains duplicate column names, remove the duplicated column.
if spreadsheet contains duplicate row names, remove the duplicated row.
if spreadsheet gene names can be mapped to ensemble gene name, then generates mapping files.

samples_clustering_pipeline

After removing empty rows and columns for user spreadsheet data, check:

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real, positive values, accept. If not, reject.
if spreadsheet contains NaN value in gene name, remove corresponding rows.
if spreadsheet contains duplicate column name, remove duplicate columns.
if spreadsheet contains duplicate row name, remove duplicate rows.
map spreadsheet gene name to ensemble name and generates mapping files.

If the user provides with the network data, check :

if network data is empty, reject.
if network data can not be intersected with genomic spreadsheet, reject.

If the user provides with the phenotype data, after removing empty rows and columns, check : 3. if phenotypic data cannot be intersected with the genomic spreadsheet, reject.

gene_prioritization_pipeline

After removing empty rows and columns for user spreadsheet data, check :

based on impute option user selected: a. reject: reject user spreadsheet if it contains any missing values. b. average: replace missing values with the mean value of the containing row. c. remove: drop any columns with missing values.
reject if genomic or phenotypic data is empty.
if spreadsheet contains non-real values, reject.
remove any rows whose gene names are missing.
if spreadsheet contains duplicate column name, remove duplicate columns.
if spreadsheet contains duplicate row name, remove duplicate rows.
map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns phenotype data, check:

If the correlation measure is t-test or edgeR... a. Force any string phenotypes to lowercase. b. Convert each phenotype to binary encoding. For each phenotype, let num_distinct_values be the number of distinct values, excluding NA, in the phenotype.
- if num_distinct_values < 2, drop the phenotype.
- if num_distinct_values == 2 and the two distinct values are 0 and 1, leave the phenotype unchanged.
- if num_distinct_values == 2 and the two distinct values are not 0 and 1, replace all instances of one of the distinct values with 0 and replace all instances of the other distinct value with 1. Preserve any missing values. Edit the phenotype name to indicate which of the original values is now represented by 1.
- if num_distinct_values > 2, expand the phenotype into num_distinct_values indicator phenotypes; any NAs will be preserved. c. For each of the binary phenotypes present at the end of step 1b, count the number of samples having value 0 and the number of samples having value 1. If either of those counts is less than 2, drop the phenotype. d. Confirm at least one phenotype remains.
for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

pasted_gene_list

After removing empty rows and columns in user spreadsheet data, check:

if a spreadsheet input gene names contain NaN value/s, remove corresponding rows.
casts index of input genes dataframe to string type
retrieve gene mapping status from database and creates a status column to existing dataframe
if the dataframe from step 3 intersects with universal genes list from redis database, mark the intersected genes with value 1, else 0.

general_clustering_pipeline

After removing empty rows and columns for user spreadsheet data, check :

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real, positive values, accept. If not, reject.
if spreadsheet contains NaN value in gene name, remove corresponding rows.
if spreadsheet contains NaN value in header, remove corresponding columns.
if spreadsheet contains duplicate row names, remove duplicate rows.
if spreadsheet contains duplicate column names, remove duplicate columns.

If the user provides with the phenotype data: After removing empty rows and columns, check :

if phenotypic spreadsheet contains duplicate column name, remove duplicate column.
if phenotypic spreadsheet contains duplicate row name, remove duplicate row.
if phenotypic spreadsheet intersects with the genomic spreadsheet, accept. If not, reject.

signature_analysis_pipeline

After removing empty rows and columns for user spreadsheet data, check :

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only positive, real value, accept. If not, reject.
if spreadsheet contains duplicate row names, reject.
if spreadsheet contains duplicate column names, reject.
if spreadsheet contains at least two unique values per column, accpet. If not, reject.
map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns for signature data, check :

if signature data can be intersected with spreadsheet.

If the user provides with the network data, check :

if the unique genes in network data has intersection with signature data and spreadsheet data.

feature_prioritization_pipeline

After removing empty rows and columns in user spreadsheet data, check :

based on impute option user selected: a. reject: reject user spreadsheet if there is NA. b. average: replace NA value with mean of each row. c. remove: drop any columns with missing values.
if spreadsheet contains non-real values, reject.
if correlation is edgeR and spreadsheet contains negative values, reject; otherwise, accept.

After removing empty rows and columns, check:

If the correlation measure is t-test or edgeR... a. Force any string phenotypes to lowercase. b. Convert each phenotype to binary encoding. For each phenotype, let num_distinct_values be the number of distinct values, excluding NA, in the phenotype.
- if num_distinct_values < 2, drop the phenotype.
- if num_distinct_values == 2 and the two distinct values are 0 and 1, leave the phenotype unchanged.
- if num_distinct_values == 2 and the two distinct values are not 0 and 1, replace all instances of one of the distinct values with 0 and replace all instances of the other distinct value with 1. Preserve any missing values. Edit the phenotype name to indicate which of the original values is now represented by 1.
- if num_distinct_values > 2, expand the phenotype into num_distinct_values indicator phenotypes; any NAs will be preserved. c. For each of the binary phenotypes present at the end of step 1b, count the number of samples having value 0 and the number of samples having value 1. If either of those counts is less than 2, drop the phenotype. d. Confirm at least one phenotype remains.
for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
for every single phenotype: 1. drops NA for the current phenotype. 2. intersects header with spreadsheet. If an intersection exists, add this phenotype to a common list until iterate through all phenotypes. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

phenotype_prediction_pipeline

After removing empty rows and columns in user spreadsheet data, check :

if spreadsheet contains NaN value/s, drop the corresponding columns.
if spreadsheet contains only real value, accept. If not, reject.
if spreadsheet contains duplicate row names, remove duplicate rows.
if spreadsheet contains duplicate column names, remove duplicate columns.
map spreadsheet gene name to ensemble name and generates mapping files.

After removing empty rows and columns in phenotype data, check :

if phenotypic data intersects with spreadsheet on phenotype.
if phenotypic data for pearson test, contains only real value or NaN.
for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.

simplified_inpherno_pipeline

After removing empty rows and columns in user spreadsheet data, check :

if expression_sample data contains only real value, accept. If not, reject.
if expression_sample data's gene name can be mapped to ensemble gene name, then generates mapping files.

After removing empty rows and columns in Pvalue gene phenotype data, check :

if Pvalue_gene_phenotype data contains only real value, accept. If not, reject.
if Pvalue_gene_phenotype's gene name can be mapped to ensemble gene name, then generates mapping files.

After removing empty rows and columns in TF expression data, check :

if TFexpression data contains only real value and doesn't contain NA, accept. If not, reject.
if TFexpression data's gene name can be mapped to ensemble gene name, then generates mapping files.

How to run this pipeline with our data

1. Clone the Data_Cleanup_Pipeline Repo

 git clone https://github.com/KnowEnG/Data_Cleanup_Pipeline.git

2. Install the following (Ubuntu or Linux)

 apt-get install -y python3-pip
 apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
 pip3 install numpy
 pip3 install pandas
 pip3 install scipy==0.19.1
 pip3 install scikit-learn==0.19.2
 apt-get install -y libfreetype6-dev libxft-dev
 pip3 install xmlrunner
 pip3 install pyyaml
 pip3 install knpackage
 pip3 install redis

3. Change directory to Data_Cleanup_Pipeline

cd Data_Cleanup_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a data cleanup pipeline

Command	Option
make run_data_cleaning	example test with large dataset
make run_samples_clustering_pipeline	samples clustering test
make run_gene_prioritization_pipeline_pearson	pearson correlation test
make run_gene_prioritization_pipeline_t_test	t-test correlation test
make run_geneset_characterization_pipeline	geneset characterization test
make run_general_clustering_pipeline	general clustering test
make run_pasted_gene_list	pasted gene list test
make run_phenotype_prediction_pipeline	phenotype prediction pipeline test
make run_feature_prioritization_pipeline_pearson	feature prioritization pipeline test
make run_feature_prioritization_pipeline_t_test_binary	feature prioritization pipeline test
make run_feature_prioritization_pipeline_t_test_replace	feature prioritization pipeline test
make run_feature_prioritization_pipeline_t_test_expand	feature prioritization pipeline test
make run_feature_prioritization_pipeline_t_test_mixed	feature prioritization pipeline test
make run_feature_prioritization_pipeline_edgeR_binary	feature prioritization pipeline test
make run_feature_prioritization_pipeline_edgeR_replace	feature prioritization pipeline test
make run_feature_prioritization_pipeline_edgeR_expand	feature prioritization pipeline test
make run_feature_prioritization_pipeline_edgeR_mixed	feature prioritization pipeline test
make run_signature_analysis_pipeline	signature analysis pipeline test
make run_simplified_inpherno_pipeline	simplified_inpherno_pipeline test

How to run this pipeline with Your data

Follow steps 1-4 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in ./Data_Cleanup_Pipeline/data/run_files/TEMPLATE_data_cleanup.yml

* Modify run_paramters file (YAML Format)

set the spreadsheet, and drug_response (phenotype data) file names to point to your data

* Run the Data Cleanup Pipeline:

Update PYTHONPATH enviroment variable

export PYTHONPATH='../src':$PYTHONPATH

Run (these relative paths assume you are in the test directory with setup as described above)

python3 ../src/data_cleanup.py -run_directory ./run_dir -run_file TEMPLATE_data_cleanup.yml

Description of "run_parameters" file

Key	Value	Comments
pipeline_type	gene_priorization_pipeline, ...	Choose pipeline cleaning type
spreadsheet_name_full_path	directory+spreadsheet_name	Path and file name of user genomic spreadsheet
phenotype_full_path	directory+phenotype_data_name	Path and file name of user phenotypic spreadsheet
gg_network_name_full_path	directory+gg_network_name	Path and file name of user network
results_directory	directory	Directory to save the output files
redis_credential	host, password and port	Credential to access gene names lookup
taxonid	9606	Taxon id of the genes
source_hint	' '	Hint for lookup ensembl names
correlation_measure	t_test/pearson/edgeR* (*FP only)	Correlation measure gene/feature prioritization pipeline

spreadsheet_name_full_path = TEST_1_gene_expression.tsv phenotype_full_path = TEST_1_phenotype.tsv

Description of Output files saved in results directory

Output files

input_file_name_ETL.tsv. Input file after Extract Transform Load (cleaning)

input_file_name_MAP.tsv.

(translated gene)	(input gene name)
ENS00000012345	abc_def_er
...	...
ENS00000054321	def_org_ifi

input_file_name_UNMAPPED.tsv.

(input gene name)	(unmapped-none)
abcd_iffe	unmapped-none
...	...
abdcefg_hijk	unmapped-none

Name		Name	Last commit message	Last commit date
Latest commit History 534 Commits
build/docker		build/docker
data		data
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KnowEnG's Data Cleanup Pipeline

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

samples_clustering_pipeline

gene_prioritization_pipeline

pasted_gene_list

general_clustering_pipeline

signature_analysis_pipeline

feature_prioritization_pipeline

phenotype_prediction_pipeline

simplified_inpherno_pipeline

How to run this pipeline with our data

1. Clone the Data_Cleanup_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Data_Cleanup_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a data cleanup pipeline

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Data Cleanup Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Releases

Packages

Contributors 6

Languages

License

KnowEnG/Data_Cleanup_Pipeline

Folders and files

Latest commit

History

Repository files navigation

KnowEnG's Data Cleanup Pipeline

Detailed cleanup steps for each pipeline

geneset_characterization_pipeline

samples_clustering_pipeline

gene_prioritization_pipeline

pasted_gene_list

general_clustering_pipeline

signature_analysis_pipeline

feature_prioritization_pipeline

phenotype_prediction_pipeline

simplified_inpherno_pipeline

How to run this pipeline with our data

1. Clone the Data_Cleanup_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Data_Cleanup_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a data cleanup pipeline

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Data Cleanup Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages