Source code to reproduce results described in the article Performance of tumour microenvironment deconvolution methods in breast cancer using single-cell simulated bulk mixtures submitted to Nature Communications on 31st July 2022. DOI:
- System requirements
- Experiment structure
- Prepare training and test data
- Method execution instructions
We generated results for this study using Python/3.6.1, R/3.5.0 on, and R/4.0.2 on CenOS Linux 7 (Core). Below are the dependencies for the Python and R languages.
- cuda 10.1
- scaden 1.1.2
- pandas 1.1.5
- numpy 1.19.5
- anndata 0.6.22.post1
- scanpy 1.4.4.post1
- gtfparse 1.2.0
- plotly 4.1.0
- smote-variants 0.4.0
- blackcellmagic 0.0.2
The method DWLS requires R/3.5.0 environment while all other R-based methods require R/4.0.2 environment.
- quadprog 1.5.5
- reshape 0.8.8
- e1071 1.6.8
- Seurat 2.3.3
- ROCR 1.0.7
- varhandle 2.0.5
- MAST 1.6.0
- BisqueRNA 1.0.5
- Biobase 2.50.0
- TED 1.4
- scBio 1.4
- Seurat 4.0.3
- foreach 1.5.1
- doSNOW 1.0.19
- parallel 4.0.2
- raster 3.4.13
- fields 12.5
- limma 3.46.0
- LiblineaR 2.10.12
- sp 1.4.5
- EPIC 1.1.5
- jsonlite 1.7.2
- hspe 0.1
- MuSiC 0.2.0
- xbioc 1.7.2
The method CIBERSORTx does not provide source code and can only be executed via a Docker container. As Docker requires root access, which could not be run on our infrastructure, we converted the CIBERSORTx Docker images to a Singularity image. This requires:
- Docker 20.10.17
- Singularity 3.6.1
There are 5 experiments with the overall directory structure as follows:
.
├── 01_purity_levels_experiment/
│ ├── include_normal_epithelial/
│ │ └── data/
│ └── exclude_normal_epithelial/
│ └── data/
├── 02_normal_epithelial_lineages_experiment/
│ └── data/
├── 03_immune_lineages_experiment/
│ ├── minor_level/
│ └── data/
│ └── subset_level/
│ └── data/
├── params/
├── src/
├── .gitignore
└── README.md
Each experiment is compacted within its own directory with its own data and execution files to run each deconvolution method either as a Jupyter notebook or a .psb
job.
Each experiment needs a data/
directory, which you can download from matching folders on Google Drive. each data/
folder should contain:
data/
├── Whole_miniatlas_meta_*.csv
├── Whole_miniatlas_umap.coords.tsv
├── training/
├── scRNA_ref.h5ad (single-cell reference profiles)
├── training_sim_mixts.h5ad (training simulated RNA-Seq data, only for Scaden)
└── training_sim_mixts.h5ad (training simulated RNA-Seq data, only for Scaden)
├── test/
├── training_sim_mixts.h5ad (training simulated RNA-Seq data, only for Scaden)
└── test_sim_mixts.h5ad (test simulated bulk RNA-seq data)
├── pretrained_scaden/ (pretrained Scaden models. Please refer to "Scaden" section)
└── expect_results/ (expected results. Please refer to "Collect test results" section)
After download, please place the data/
folders in their corresponding experiments.
Each deconvolution requires either single-cell reference profiles or simulated mixtures as training data to estimate cell populations in the test_sim_mixts.h5ad
files. To prepare this data, simply run the 0_prepare_method_specific_data.ipynb
notebook in each experiment directory. Once finished, method-specific sub-folders will be created in the data/
directory:
data/
├── bisque/
├── bprism/
├── cbx/
├── cpm/
├── dwls/
├── epic/
├── hspe/
├── music/
├── scaden/
....
NOTE: You will need at least 4 CPUs and 275GiB memory to run this step for each experiment.
Besides CIBERSORTx
and Scaden
, .psb
run scripts of all methods point to the R
code in src/
. CIBERSORTx
execution is based on the Singularity
, and Scaden
execution is based on the bash command scaden
(please refere to System requirements).
CPUs and memory requirements for each job are already specified in corresponding .pbs
files.
NOTE
: each method could produce different sets of results depending on the specified parameter. For example, BayesPrism enables users to optionally input marker genes and cell states (cell subtypes), or to use only marker genes. Choosing to use each of these functionality will produce a different set of results, meaning the results
folder in the .pbs
script for each method needs to be adjusted accordingly.
Data files:
logged_scRNA_ref.csv
: bisque requires the single-cell reference data to be loggedphenotypes.csv
: cell labels. Matches the order in which cells are listed inlogged_scRNA_ref.csv
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_bisque.pbs
file - Run method as pbs job
Data files:
scRNA_ref.csv
: single-cell reference profiles. Columns are cell ids and rows are gene symbolssingle_cell_labels.csv
: single cell labels. Matches the order in which cells are listed inscRNA_ref.csv
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_bprism.pbs
file - (Optional) Increase number of available CPUs in the
PBS -l
option andN_CPUS
variable. - Run method as pbs job
Data files:
scRNA_ref.txt
: the single-cell reference matrix. Columns are cells and column names are cell types. CBX infers cell types using column names and therefore does not need a single-cell labels files like other methods.test_counts_*_pur_lvl.txt
: as I mentioned above, CBX requires the bulk expressions to be in its directory.
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
in1_run_cbx.pbs
file - Copy
test_counts_<PUR_LVL>_pur_lvl.txt
files fromtest/
tocbx/
- Download the
fractions_latest.sif
file from Google Drive. This is the same container provided by CIBERSORTx. We simply converted the image from Docker to Singularity, as Docker requires sudo access and Singularity does not - Update Singularity commands
- Path to the Singularity image
fractions_latest.sif
--token
with the token requested from cibersortx.stanford.edu
- Path to the Singularity image
Data:
We slightly modified the CPM source code to split the execution in half. The first half uses 48 CPUs and 300-500Gi of memory. The second half only uses 4 CPUs and 100Gi of memory.
scRNA_ref_1330_per_ctype_*.txt
: single-cell reference data. We only use 1,330 or less per cell type.cell_state.csv
: the cell-state space that CPM uses to map how cells transition from one cell type to another. We use UMAP coordinates as cell-state space.single_cell_label.csv
: cell labels. Matches the order in which cells are listed inscRNA_ref_1330_per_ctype_*.txt
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_cpm_1.pbs
and1_run_cpm_2.pbs
files - Update path to the
1_CPM_functions.R
file insrc/1_run_cpm_1.R
andsrc/1_run_cpm_2.R
- Run
1_run_cpm_1.pbs
first as a pbs job - Wait for
1_run_cpm_1.pbs
to finish. Then runsrc/1_run_cpm_2.pbs
as PBS job
Data:
scRNA_ref.txt
: the single-cell reference matrixsingle_cell_labels.csv
: cell labels. Matches the order in which cells are listed in single-cell reference
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_dwls_1.pbs
and1_run_dwls_2.pbs
files - Run method as pbs job. We restructured the original DWLS into 2 stages:
1_run_dwls_1.pbs
and1_run_dwls_2.pbs
, using thesrc
files1_run_dwls_1.R
and1_run_dwls_2.R
respectively.1_run_dwls_1.pbs
runs differential expression analysis using eitherseurat
ormast
and generated cell-type-specific.RData
files1_run_dwls_2.R
collects differentially expressed genes produced in the previous step and conducts deconvolution.
We also provide the pbs
script 1_run_dwls_all.pbs
as an alternaitve should any users prefer to run DWLS
in a stageless fashsion.
Data:
EPIC is the only method that requires a pre-built signature matrix instead of single-cell reference. We provided EPIC with the signature matrix that CBX generated during its execution. This is the reason why you see data files for EPIC in the cbx_sig_matrix
folder.
reference_profiles.csv
: signature matrix that CBX producedmarker_gene_symbols.csv
: marker genes among all genes in the signature matrix. EPIC weights these genes higher non non-marker genes. In our study, this is merely a procedural necessity as we list all genes in the signature matrix as marker genes.
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_epic.pbs
file - Run method as pbs job
Data:
scRNA_ref.csv
: the single-cell reference matrixpure_samples.json
: cell labels in a dictionary. Matches the order in which cells are listed in single-cell referencelogged_test_counts_*_pur_lvl.txt
: as mentioned above, hspe requires the bulk expressions to be logged.
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_hspe.pbs
file - Run method as pbs job
Data:
scRNA_ref.csv
: the single-cell reference matrixpheno.csv
: cell labels and patient ids of cells in the single-cell reference. Matches the order in which cells are listed in single-cell reference
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
andSRC_DIR
paths in1_run_music.pbs
file - Run method as pbs job
Data:
train_counts.h5ad
: Scaden is the only methods that requires simulated bulk mixtures as training data. It also requires these mixtures to be compressed into an AnnData (.h5ad) object.test_counts_*_pur_lvl.txt
: as I mentioned above, CBX requires the bulk expressions to be in its directory.
Running:
- Replace user email in
PBS -M
option - Modify
WORK_DIR
in1_run_scaden.pbs
file - Copy
test_counts_<PUR_LVL>_pur_lvl.txt
files fromtest/
folder intoscaden/
folder
NOTE 1:
scaden is built using the tensorflow
library and requires cuda 10.1
to run. In this study, we installed scaden
and cuda 10.1
in 2 separate environments, reflected in the .pbs
script as:
source /software/scaden/scaden-0.9.4-venv/bin/activate
module load gpu/cuda/10.1
This means without CUDA 10.1 driver, Scaden
will default to using CPUs instead of GPUs, which takes significantly longer..
NOTE 2:
The Scaden
method is a neural network and therefore produces slightly different results when trained from scratch. However, results from a newly trained model should be similar to results generated by the pre-trained models.
For testing Scaden pre-trained models:
- Edit the
01_run_scaden.pbs
script to point todata/pretrained_scaden/
instead ofdata/scaden/
Copytest_counts_<PUR_LVL>_pur_lvl.txt
files fromtest/
folder intopretrained_scaden/
folder - Comment out
scaden train
in01_run_scaden.pbs
.scaden process
should produce identical processed data files compared todata/scaden
- Run
01_run_scaden.pbs
Once all methods have been executed, you can collect results from all methods using the 2_collate_test_results.ipynb
Jupyter notebook in each experiment directory.
Simply update the prefix
variable at the top of each notebook with the correct path to the experiment folder and run each notebook from top to bottom. After that, all methods' results will be collated into the data/results
folder.
In addition, you can also find all results in this study in data/expected_results/
. Each expected_results/
sub-folder of each experiment should contain:
└── data/
└── expect_results/
├── bisque.csv
├── bprism.csv
├── cbx.csv
├── cpm.csv
├── dwls.csv
├── epic.csv
├── hspe.csv
├── music.csv
├── scaden.csv
└── truth.csv