Deconvoluting doublets from single-cell RNA-sequencing data
See our Cell Reports paper for more information on DoubletDecon. Also see our bioRxiv for an older description of the algorithm.
- NEW! Integrated ICGS2_to_ICGS1() is now available to support input files from ICGS version 2. You should not have to make any changes in your DoubletDecon workflow to use ICGS version 2 instead of ICGS version 1
- Improved_Seurat_Pre_Process() now loads dplyr at the beginning of the fuction (thanks chansigit for the feedback!)
- Fixed bugs in Remove_Cell_Cycle() that was preventing it from running on certain datasets
- NEW! Improved_Seurat_Pre_Process() is now available to replace Seurat_Pre_Process() for those who would prefer to work directly with a Seurat Object as input instead of individual files saved from a Seurat workflow. Workflows following the protocol found at https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html, from the provided script (seurat-3.0.R), or similar will be sufficent for this new function.
- Resolved compatibility issues with Seurat version 3
- Change default for only50 to FALSE (from TRUE) to reflect best practices in running DoubletDecon
- NEW! DoubletDecon UI available in the GitHub repository (requires R3.5.0 or later and RStudio with 'shiny' package installed)
- NEW! Improved Rescue step, Pseudo_Marker_Finder, now uses parallel processing and data chunking to improve speed and memory efficency. Results remain the same with the exception of no longer saving p-values (future release)
- Remove downsample and sample_num
- Upped num_doubs default value to 100 (from 30)
- Require 'tidyr', 'R.utils', 'forrach', 'doParallel', 'stringr'. No longer require 'hopach'
- Changed log_name_file to the value for filename, for compatibility with Windows operating systems
- Automatically write processed data and groups files from Clean_Up_Data (even if write=FALSE) for use with new Rescue step (Pseudo_Marker_Finder)
- Finalize switch to more granular doublet calls in the case of no Rescue step
- Change name of cluster merging plot to "Cluster Merge" (from "Blacklist")
- Fixed bug in Remove step when number of clusters equals 2 (Euclidean distance is used in the place of Pearson correlation)
- General bug fixes affecting final groups file and final expression file output.
- Added user option to specify minimum number of unique genes to Rescue a putative doublet cluster (previously set at 4).
- Speed up run time for users who do not use the Rescue step (PMF=FALSE).
- Remove requirement for 'as.color' function.
- Additional "Remove" step option to create synthetic doublet centroids with 30%/70% and 70%/30% parent cell contribution instead of simply 50%/50% (only50=FALSE).
- Heatmap generation corrected for large datasets (>5000 cells).
- "Rescue" step modification from t-tests for all clusters to ANOVA with Tukey post-hoc test in only putative doublet clusters. Minimum of 4 unique genes as hardcoded default.
- "Rescue" step now allows for sampling of clusters evenly or proportional to cluster size when using the full expression matrix.
- Hopach removed as a "Recluster" option; does not work with improved "Rescue" step. Subsequently removed the DeconCalledFreq table as written and returned output.
- Log file is generated with a unique ID for each run of DoubletDecon.
- Catch error in mcl() function and quits DoubletDecon with warning to choose a different rhop value.
- Synthetic doublet deconvolution values output for quality control (Synth_doublet_info)
Run the following code to install the package using devtools:
if(!require(devtools)){
install.packages("devtools") # If not already installed
}
devtools::install_github('EDePasquale/DoubletDecon')
DoubletDecon requires the following R packages:
- DeconRNASeq
- gplots
- dplyr
- MCL
- clusterProfiler
- mygene
- tidyr
- R.utils
- foreach
- doParallel
- stringr
- Seurat (for Improved_Seurat_Pre_Process only)
These can be installed with:
source("https://bioconductor.org/biocLite.R")
biocLite(c("DeconRNASeq", "clusterProfiler", "hopach", "mygene", "tidyr", "R.utils", "foreach", "doParallel", "stringr"))
install.packages("MCL")
Additionally, the use of the cell cycle removal option requires an internet connection.
Improved_Seurat_Pre_Process(seuratObject, num_genes=50, write_files=FALSE)
- seuratObject Seurat object following a protocol such as https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
- num_genes Number of genes for the top_n function. Default is 50.
- write_files Save the output files to .txt format. Defauly is FALSE.
- newExpressionFile - Seurat expression file in ICGS format (ICGS genes)
- newFullExpressionFile - Seurat expression file in ICGS format (all genes)
- newGroupsFile - Groups file ICGS format
- expressionFile: Normalized expression matrix or counts file as a .txt file (expression from Seurat's NormalizeData() function)
- genesFile: Top marker gene list as a .txt file from Seurat's top_n() function
- clustersFile: Cluster identities as a .txt file from Seurat object @ident
- newExpressionFile - Seurat expression file in ICGS format (used as 'rawDataFile')
- newGroupsFile - Groups file ICGS format (used as 'groupsFile')
Main_Doublet_Decon(rawDataFile, groupsFile, filename, location,
fullDataFile = NULL, removeCC = FALSE, species = "mmu", rhop = 1,
write = TRUE, PMF = TRUE, useFull = FALSE, heatmap = TRUE, centroids=FALSE, num_doubs=100,
only50=FALSE, min_uniq=4)
- rawDataFile: Name of file containing ICGS or Seurat expression data (gene by cell)
- groupsFile: Name of file containing group assignments (3 column: cell, group(numeric), group(numeric or character))
- filename: Unique filename to be incorporated into the names of outputs from the functions.
- location: Directory where output should be stored
- fullDataFile: Name of file containing full expression data (gene by cell). Default is NULL.
- removeCC: Remove cell cycle gene cluster by KEGG enrichment. Default is FALSE.
- species: Species as scientific species name, KEGG ID, three letter species abbreviation, or NCBI ID. Default is "mmu".
- rhop: x in mean+x*SD to determine upper cutoff for correlation in the blacklist. Default is 1.
- write: Write output files as .txt files. Default is TRUE.
- recluster: Recluster deconvolution classified doublets and non-doublets seperately using hopach or deconvolution classifications.
- PMF: Use step 3 (unique gene expression) in doublet determination criteria. Default is TRUE.
- useFull: Use full gene list for PMF analysis. Requires fullDataFile. Default is FALSE.
- heatmap: Boolean value for whether to generate heatmaps. Default is TRUE. Can be slow to datasets larger than ~3000 cells.
- centroids: Use centroids as references in deconvolution instead of the default medoids.
- num_doubs: The user defined number of doublets to make for each pair of clusters. Default is 100.
- only50: use only synthetic doublets created with 50%/50% mix of parent cells, as opposed to the extended option of 30%/70% and 70%/30%, default is FALSE.
- min_uniq: minimum number of unique genes required for a cluster to be rescued, default is 4.
- data_processed = new expression file (cleaned).
- groups_processed = new groups file (cleaned).
- PMF_results = pseudo marker finder t-test results (gene by cluster).
- DRS_doublet_table = each cell and whether it is called a doublet by deconvolution analysis.
- DRS_results = results of deconvolution analysis (cell by cluster) in percentages.
- Decon_called_freq = percentage of doublets called in each cluster by deconvolution analysis.
- Final_doublets_groups = new groups file containing only doublets.
- Final_nondoublets_groups = new groups file containing only non doublets.
- Synth_doublet_info = synthetic doublet deconvolution values output for quality control.
Data for this example can be found in this GitHub repository. Examples are given for both Seurat_Pre_Process() and Improved_Seurat_Pre_Process(), though the latter is prefered if using Seurat 3.
location="/Users/xxx/xxx/" #Update as needed
<s>
#Seurat_Pre_Process()
expressionFile=paste0(location, "counts.txt")
genesFile=paste0(location, "Top50Genes.txt")
clustersFile=paste0(location, "Cluster.txt")
newFiles=Seurat_Pre_Process(expressionFile, genesFile, clustersFile)
</s>
#Improved_Seurat_Pre_Process()
seuratObject=readRDS("seurat.rds")
newFiles=Improved_Seurat_Pre_Process(seuratObject, num_genes=50, write_files=FALSE)
filename="PBMC_example"
write.table(newFiles$newExpressionFile, paste0(location, filename, "_expression"), sep="\t")
write.table(newFiles$newFullExpressionFile, paste0(location, filename, "_fullExpression"), sep="\t")
write.table(newFiles$newGroupsFile, paste0(location, filename , "_groups"), sep="\t", col.names = F)
results=Main_Doublet_Decon(rawDataFile=newFiles$newExpressionFile,
groupsFile=newFiles$newGroupsFile,
filename=filename,
location=location,
fullDataFile=NULL,
removeCC=FALSE,
species="hsa",
rhop=1.1,
write=TRUE,
PMF=TRUE,
useFull=FALSE,
heatmap=FALSE,
centroids=TRUE,
num_doubs=100,
only50=FALSE,
min_uniq=4)