This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Data Cleanup Pipeline. This pipeline cleanup the data of a given spreadsheet for subsequent processing by KnowEnG Analytics Platform.
After removing empty rows and columns for user spreadsheet data, check :
- if spreadsheet is empty, reject.
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only non-negative real value. If not, reject.
- if spreadsheet contains NaN value, reject.
- if spreadsheet contains duplicate column names, remove the duplicated column.
- if spreadsheet contains duplicate row names, remove the duplicated row.
- if spreadsheet gene names can be mapped to ensemble gene name, then generates mapping files.
After removing empty rows and columns for user spreadsheet data, check:
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real, positive values, accept. If not, reject.
- if spreadsheet contains NaN value in gene name, remove corresponding rows.
- if spreadsheet contains duplicate column name, remove duplicate columns.
- if spreadsheet contains duplicate row name, remove duplicate rows.
- map spreadsheet gene name to ensemble name and generates mapping files.
If the user provides with the network data, check :
- if network data is empty, reject.
- if network data can not be intersected with genomic spreadsheet, reject.
If the user provides with the phenotype data, after removing empty rows and columns, check : 3. if phenotypic data cannot be intersected with the genomic spreadsheet, reject.
After removing empty rows and columns for user spreadsheet data, check :
- based on impute option user selected: a. reject: reject user spreadsheet if it contains any missing values. b. average: replace missing values with the mean value of the containing row. c. remove: drop any columns with missing values.
- reject if genomic or phenotypic data is empty.
- if spreadsheet contains non-real values, reject.
- remove any rows whose gene names are missing.
- if spreadsheet contains duplicate column name, remove duplicate columns.
- if spreadsheet contains duplicate row name, remove duplicate rows.
- map spreadsheet gene name to ensemble name and generates mapping files.
After removing empty rows and columns phenotype data, check:
- If the correlation measure is t-test or edgeR...
a. Force any string phenotypes to lowercase.
b. Convert each phenotype to binary encoding. For each phenotype, let
num_distinct_values
be the number of distinct values, excluding NA, in the phenotype.- if
num_distinct_values
< 2, drop the phenotype. - if
num_distinct_values
== 2 and the two distinct values are 0 and 1, leave the phenotype unchanged. - if
num_distinct_values
== 2 and the two distinct values are not 0 and 1, replace all instances of one of the distinct values with 0 and replace all instances of the other distinct value with 1. Preserve any missing values. Edit the phenotype name to indicate which of the original values is now represented by 1. - if
num_distinct_values
> 2, expand the phenotype intonum_distinct_values
indicator phenotypes; any NAs will be preserved. c. For each of the binary phenotypes present at the end of step 1b, count the number of samples having value 0 and the number of samples having value 1. If either of those counts is less than 2, drop the phenotype. d. Confirm at least one phenotype remains.
- if
- for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
- for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.
After removing empty rows and columns in user spreadsheet data, check:
- if a spreadsheet input gene names contain NaN value/s, remove corresponding rows.
- casts index of input genes dataframe to string type
- retrieve gene mapping status from database and creates a status column to existing dataframe
- if the dataframe from step 3 intersects with universal genes list from redis database, mark the intersected genes with value 1, else 0.
After removing empty rows and columns for user spreadsheet data, check :
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real, positive values, accept. If not, reject.
- if spreadsheet contains NaN value in gene name, remove corresponding rows.
- if spreadsheet contains NaN value in header, remove corresponding columns.
- if spreadsheet contains duplicate row names, remove duplicate rows.
- if spreadsheet contains duplicate column names, remove duplicate columns.
If the user provides with the phenotype data: After removing empty rows and columns, check :
- if phenotypic spreadsheet contains duplicate column name, remove duplicate column.
- if phenotypic spreadsheet contains duplicate row name, remove duplicate row.
- if phenotypic spreadsheet intersects with the genomic spreadsheet, accept. If not, reject.
After removing empty rows and columns for user spreadsheet data, check :
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only positive, real value, accept. If not, reject.
- if spreadsheet contains duplicate row names, reject.
- if spreadsheet contains duplicate column names, reject.
- if spreadsheet contains at least two unique values per column, accpet. If not, reject.
- map spreadsheet gene name to ensemble name and generates mapping files.
After removing empty rows and columns for signature data, check :
- if signature data can be intersected with spreadsheet.
If the user provides with the network data, check :
- if the unique genes in network data has intersection with signature data and spreadsheet data.
After removing empty rows and columns in user spreadsheet data, check :
- based on impute option user selected: a. reject: reject user spreadsheet if there is NA. b. average: replace NA value with mean of each row. c. remove: drop any columns with missing values.
- if spreadsheet contains non-real values, reject.
- if correlation is edgeR and spreadsheet contains negative values, reject; otherwise, accept.
After removing empty rows and columns, check:
- If the correlation measure is t-test or edgeR...
a. Force any string phenotypes to lowercase.
b. Convert each phenotype to binary encoding. For each phenotype, let
num_distinct_values
be the number of distinct values, excluding NA, in the phenotype.- if
num_distinct_values
< 2, drop the phenotype. - if
num_distinct_values
== 2 and the two distinct values are 0 and 1, leave the phenotype unchanged. - if
num_distinct_values
== 2 and the two distinct values are not 0 and 1, replace all instances of one of the distinct values with 0 and replace all instances of the other distinct value with 1. Preserve any missing values. Edit the phenotype name to indicate which of the original values is now represented by 1. - if
num_distinct_values
> 2, expand the phenotype intonum_distinct_values
indicator phenotypes; any NAs will be preserved. c. For each of the binary phenotypes present at the end of step 1b, count the number of samples having value 0 and the number of samples having value 1. If either of those counts is less than 2, drop the phenotype. d. Confirm at least one phenotype remains.
- if
- for pearson test, check if a phenotypic data contains only numeric value. If not, reject.
- for every single phenotype: 1. drops NA for the current phenotype. 2. intersects header with spreadsheet. If an intersection exists, add this phenotype to a common list until iterate through all phenotypes. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.
After removing empty rows and columns in user spreadsheet data, check :
- if spreadsheet contains NaN value/s, drop the corresponding columns.
- if spreadsheet contains only real value, accept. If not, reject.
- if spreadsheet contains duplicate row names, remove duplicate rows.
- if spreadsheet contains duplicate column names, remove duplicate columns.
- map spreadsheet gene name to ensemble name and generates mapping files.
After removing empty rows and columns in phenotype data, check :
- if phenotypic data intersects with spreadsheet on phenotype.
- if phenotypic data for pearson test, contains only real value or NaN.
- for every single drug: 1. drops NA for the current drug. 2. intersects header with spreadsheet. If an intersection exists, add this drug to a common list until iterate through all drugs. 3. checks if the common list return by step 2 is emtpy. If it's empty, reject.
After removing empty rows and columns in user spreadsheet data, check :
- if expression_sample data contains only real value, accept. If not, reject.
- if expression_sample data's gene name can be mapped to ensemble gene name, then generates mapping files.
After removing empty rows and columns in Pvalue gene phenotype data, check :
- if Pvalue_gene_phenotype data contains only real value, accept. If not, reject.
- if Pvalue_gene_phenotype's gene name can be mapped to ensemble gene name, then generates mapping files.
After removing empty rows and columns in TF expression data, check :
- if TFexpression data contains only real value and doesn't contain NA, accept. If not, reject.
- if TFexpression data's gene name can be mapped to ensemble gene name, then generates mapping files.
git clone https://github.com/KnowEnG/Data_Cleanup_Pipeline.git
apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy
pip3 install pandas
pip3 install scipy==0.19.1
pip3 install scikit-learn==0.19.2
apt-get install -y libfreetype6-dev libxft-dev
pip3 install xmlrunner
pip3 install pyyaml
pip3 install knpackage
pip3 install redis
cd Data_Cleanup_Pipeline
cd test
make env_setup
Command | Option |
---|---|
make run_data_cleaning | example test with large dataset |
make run_samples_clustering_pipeline | samples clustering test |
make run_gene_prioritization_pipeline_pearson | pearson correlation test |
make run_gene_prioritization_pipeline_t_test | t-test correlation test |
make run_geneset_characterization_pipeline | geneset characterization test |
make run_general_clustering_pipeline | general clustering test |
make run_pasted_gene_list | pasted gene list test |
make run_phenotype_prediction_pipeline | phenotype prediction pipeline test |
make run_feature_prioritization_pipeline_pearson | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_t_test_binary | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_t_test_replace | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_t_test_expand | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_t_test_mixed | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_edgeR_binary | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_edgeR_replace | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_edgeR_expand | feature prioritization pipeline test |
make run_feature_prioritization_pipeline_edgeR_mixed | feature prioritization pipeline test |
make run_signature_analysis_pipeline | signature analysis pipeline test |
make run_simplified_inpherno_pipeline | simplified_inpherno_pipeline test |
Follow steps 1-4 above then do the following:
mkdir run_directory
cd run_directory
mkdir results_directory
Look for examples of run_parameters in ./Data_Cleanup_Pipeline/data/run_files/TEMPLATE_data_cleanup.yml
set the spreadsheet, and drug_response (phenotype data) file names to point to your data
- Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH
- Run (these relative paths assume you are in the test directory with setup as described above)
python3 ../src/data_cleanup.py -run_directory ./run_dir -run_file TEMPLATE_data_cleanup.yml
Key | Value | Comments |
---|---|---|
pipeline_type | gene_priorization_pipeline, ... | Choose pipeline cleaning type |
spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user genomic spreadsheet |
phenotype_full_path | directory+phenotype_data_name | Path and file name of user phenotypic spreadsheet |
gg_network_name_full_path | directory+gg_network_name | Path and file name of user network |
results_directory | directory | Directory to save the output files |
redis_credential | host, password and port | Credential to access gene names lookup |
taxonid | 9606 | Taxon id of the genes |
source_hint | ' ' | Hint for lookup ensembl names |
correlation_measure | t_test/pearson/edgeR* (*FP only) | Correlation measure gene/feature prioritization pipeline |
spreadsheet_name_full_path = TEST_1_gene_expression.tsv phenotype_full_path = TEST_1_phenotype.tsv
- Output files
input_file_name_ETL.tsv. Input file after Extract Transform Load (cleaning)
input_file_name_MAP.tsv.
(translated gene) | (input gene name) |
---|---|
ENS00000012345 | abc_def_er |
... | ... |
ENS00000054321 | def_org_ifi |
input_file_name_UNMAPPED.tsv.
(input gene name) | (unmapped-none) |
---|---|
abcd_iffe | unmapped-none |
... | ... |
abdcefg_hijk | unmapped-none |