A Python implementation of a part of compareDEtools.
comPyDEtools can ...
- Generate simulated dataset (KIRC, Bottomly, mKdB or mBdK)
- Run DE analysis (using
subprocess.run()
) - Generate the figures like Fig 2 in Baik 2020
and can't ...
- SEQC benchmark (like Fig 1 in Baik 2020)
- False positive count comparison (like Fig 3 in Baik 2020)
- etc
pip install https://github.com/136s/comPyDEtools.git
-
Make a condition file like compydetools/data/synthetic_conditions.yaml.
-
Run
python -m compydetools condition.yaml # specify your condition file made at step 1.
or
run in Python
from compydetools.condition import CONDITION, set_condition from compydetools.core import Paper from compydetools.utils import run_commands set_condition("condition.yaml") # specify your condition file made at Usage 1. paper = Paper(nrep=CONDITION.nrep) paper.generate_datasets() for anal_res in run_commands(CONDITION.analysis.cmds): print(anal_res) paper.make()
-
Check generated files
input/
: simulated RNA-seq data- dataset structure
- first line is header
Gene_ID
column: sequential numbers from 1 to the number of genesGene_Symbol
column: "LOC" +Gene_ID
Description
column: "up" (upregulated), "dn" (downregulated) or "ns" (not significant)- remaining columns: simulated expression counts for each samples and smaple names are "TRT-*" (treatment sample) or "CTRL-*" (control sample) (* is a sequential number for each condition)
- dataset property
- file path:
{simul_data}_{disp_type}_upFrac{frac_up}_{nsample}spc_{outlier_mode}_{nde}DE/{simul_data}_{disp_type}_upFrac{frac_up}_{nsample}spc_{outlier_mode}_{nde}DE_rep{seed}.tsv
- newline character: LF
- enxoding: UTF-8
- file path:
- dataset structure
result/
: plots of performance comparison
-
analysis
: configuration of DE analysis-
cmds
: a list of DE analysis commands -
res
: a reguler expression of a path to result files- "{count_stem}" replaced by dataset path stem
- "{method_type}" replaced by method_type
-
de_true
: column name of deg regulation (up, dn or ns) in each result files (defaults to "Description") -
de_score
: column name of deg score like p-value in each result files (defaults to "padj") -
de_score_threshold
: threshold ofde_score
(DEGs'de_score
is lower thande_score_threshold
)
-
-
dirs
: directories of generated files-
dataset
: generated simulated datasets -
result
: plots of performance comparison, csv of metrics values and pickle ofPaper
instance
-
-
simul_data
: KIRC, Bottomly, mKdB or mBdK -
disp_type
: same or differnt -
frac_up
: fraction upregulated in DEGs (float,$[0, 1]$ ) -
nsample
: number of samples per groups (int, 3<=) -
outlier_mode
: D, R, OS, or DL -
pde
: percent of DE in all genes (float,$(0, 100]$ ) -
metrics_type
: auc, tpr, fdr, cutoff, f1score or kapppa- if you want to add any metrics, modify
const.Metrics
andutils.calc_metrics()
by fork or PR
- if you want to add any metrics, modify
-
method_type
: specify your DE analysis method (defaults to {"deseq2": "Deseq2"})- comPyDEtools recognizes the type of DE analysis method only by the output folder path (
analysis.res
in the condition file)
- comPyDEtools recognizes the type of DE analysis method only by the output folder path (
-
nrep
: number of simulation repetition under one condition (int,$3<=$ )
erDiagram
Paper |o--|{ Figure : "has a list of"
Figure |o--|{ Plot : "has a list of"
Plot ||--|{ DataPool : "has a list of"
DataPool ||--|{ Dataset : "has a list of"
DataPool ||--|{ Result : "has a list of"
Dataset ||--|| Result : ""
Paper {
int nrep "number of repetition in a data pool (3<=)"
int seed "global random seed"
list[Figure] figures
}
Figure {
Simul simul_data PK "simulation data (KIRC, Bottomly, mKdB or mBdK)"
Disp disp_type PK "dispersion type (same or differnt)"
float frac_up PK "fraction upregulated ([0, 1])"
list[Plot] plots
}
Plot {
int nsample PK "number of samples per condition (3<=)"
Outlier outlier_mode PK "outlier mode (D, R, OS, or DL)"
list[DataPool] datapools
}
DataPool {
float pde PK "percent of DE in all genes ((0, 100])"
list[Dataset] datasets
list[DataPool] datapools
}
Dataset {
int seed PK "random seed for each dataset generated from global seed"
DataFrame counts "simulated count matrix"
}
Result {
int seed PK "random seed for each dataset"
list[Method] method_types "a list of DE analysis methods to be compared"
list[Metrics] metrics_types "a list of metrics to comprere DE analysis methods"
}
Paper
class represents all figures in the condition fileFigure
class represents a figure (like Fig 2)Plot
class represents a sub figure (like Fig 2A)DataPool
class represents same condtion datasets (containsnrep
datasets)Dataset
class represents a simulated count matrixResult
class represents a results of aDataset
under each method and metrics
property | Class | Paper |
Figure |
Plot |
DataPool |
Dataset |
Result |
---|---|---|---|---|---|---|
a list of | Figure |
Plot |
DataPool |
Dataset , Result |
||
number of repetition (nrep ) |
1 | 1 | 1 | 1 | ||
simulation data (simul_data ) |
1 | 1 | 1 | 1 | 1 | |
dispersion type (disp_type ) |
1 | 1 | 1 | 1 | 1 | |
fraction upregulated (frac_up ) |
1 | 1 | 1 | 1 | 1 | |
number of samples (nsample ) |
1 | 1 | 1 | 1 | ||
outlier mode (outlier_mode ) |
1 | 1 | 1 | 1 | ||
percent of DE in all genes (pde ) |
1 | 1 | 1 | |||
simulated count matrix | 1 | 1 | ||||
method type (method_type ) |
* | |||||
metrics type (metrics_type ) |
* |
Table: Class / property correspondence (*: many)
Simul
class is a list of simulation dataset namessimul_data
in the condition file
Disp
class is a list of dispersion conditiondisp_type
in the condition file
Outlier
class is a list of outlier modeoutlier_mode
in the condition file
Metrics
class is a list of metrics of performance comparisonmetrics_type
in the condition file
Method
class is a list of DE analysis methodmethod_type
in the condition file
This is a partial port of unistbig/compareDEtools.