DXMarkers: A tool for rapidly extracting markers from the "CellMarker", "CellMarker2.0", and "PanglaoDB" databases
DXMarkers is a meticulously designed R package with the aim of providing more efficient tools for manual cell type annotation in single-cell sequencing (scRNA-seq) analysis. It leverages highly expressed genes specific to particular cell types, also known as "Markers," which can be efficiently retrieved from three widely recognized Cell Marker databases: "CellMarker," "CellMarker2.0," and "PanglaoDB." Moreover, DXMarkers offers functionality to filter search results by species and tissue type, enabling users to obtain more accurate information during the process.
DXMarkers supports various installation methods, including GitHub, Gitee, and local installation.
Before installing the DXMarkers package from GitHub, make sure that the devtools
package has been installed. If not yet installed, you can install it using install.packages("devtools")
. After that, install DXMarkers by executing the following command:
library(devtools)
install_github("DaXuanGarden/DXMarkers")
To install the DXMarkers package on Gitee, you first need to install the remotes
package, which can be installed through install.packages("remotes")
. Then, install from Gitee using the following command:
remotes::install_git('https://gitee.com/DaXuanGarden/dxmarkers')
For local installation, you need to specify the local path of the R package. First, set the working path, then load the devtools
package to install:
setwd("/home/data/t050446/01 Single Cell Project/Acute Pancreatitis")
library(devtools)
install_local("/home/data/t050446/Non-tumor research/DXMarkers_1.0.tar.gz")
If you wish to obtain the R package file for local installation, you can follow the official account "大轩的成长花园" (Garden of Da Xuan) and leave a message mentioning "DXMarkers." Da Xuan will promptly send you the packaged installation file. By following the provided instructions, you can quickly install DXMarkers and use it for batch querying of markers, assisting you in your single-cell analysis.
Before we use the R package for Marker retrieval, we first need to process and analyze single-cell data.
-
Load raw data:This step is to load the raw gene expression matrix into the R environment, where rows typically represent genes and columns represent cells.
-
Create a Seurat object:Seurat is an R package for single-cell RNA-seq data analysis. We first need to create a Seurat object to store and manipulate the data.
-
Data preprocessing and quality control:In this step, we will remove cells with too few or too many expressed genes, as well as cells with a high proportion of mitochondrial genes.
-
Data normalization:This step normalizes the gene expression level of each cell by its total expression, to eliminate technical variations between cells.
-
Find highly variable genes:Highly variable genes are genes that show large differences in expression levels among different cells. These genes can be used to distinguish different cell types.
-
Data scaling and linear dimension reduction:This step is to reduce the dimensionality of the data for subsequent clustering analysis.
-
Cell clustering:Group cells with similar gene expression patterns together.
-
Non-linear dimension reduction:Non-linear dimension reduction techniques such as t-SNE or UMAP can help us visualize high-dimensional single-cell data on a two-dimensional plane.
-
Plotting clustering results:Plot each cell on a two-dimensional plane according to its clustering results, to observe the distribution of different cell populations.
-
Finding marker genes:Marker genes are used to identify specific cell populations, and their expression levels are significantly higher in that cell population than in other cell populations.
-
The following is a sample code that demonstrates how to process single-cell data using the Seurat library.
# Manual Annotation # Load Seurat library library(Seurat) # Load data scRNA <- readRDS(file = "scRNA.rds") # Find markers for all clusters scRNA.markers <- FindAllMarkers(scRNA, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # Filter out genes with p_val < 0.05 all.markers <- scRNA.markers %>% dplyr::select(gene, everything()) %>% dplyr::filter(p_val < 0.05) # Select top 10 genes by avg_log2FC top10 <- all.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) # Write out the results, output genes with p_val < 0.05 write.csv(all.markers, file = "all_markers.csv", row.names = FALSE) # Output top 10 genes by avg_log2FC write.csv(top10, file = "top10_genes.csv", row.names = FALSE) # Draw heatmap of top 10 marker genes DoHeatmap(scRNA, features = top10$gene) # Draw violin plot of top 20 marker genes VlnPlot(scRNA, features = top10$gene[1:20], pt.size = 0) # Draw dimensionality reduction plot DimPlot(scRNA, label = TRUE, reduction = "tsne")
After the data is read and processed as described above, you will obtain a list of markers related to single-cell data. This list serves as the initial input file for DXMarkers.
Process data using the reshape_genes_wide
function in DXMarkers, then save the result to a file:
library(DXMarkers)
top10_data <- read.csv("top10_genes.csv")
sorted_result <- reshape_genes_wide(top10_data)
write.csv(sorted_result, file = "top10_sorted_genes.csv", row.names = FALSE)
After the data processing, use the annotate_markers
function in DXMarkers for one-click marker retrieval:
library(DXMarkers)
top10_genes_data <- read.csv("top10_sorted_genes.csv", stringsAsFactors = FALSE)
DXMarkers_result <- annotate_markers(top10_genes_data, "Mouse", "Pancreas")
With the view_data_source
function in DXMarkers, you can view the complete dataset of the built-in "CellMarker", "CellMarker2.0", or "PanglaoDB" database:
library(DXMarkers)
data_source <- "CellMarker"
data <- view_data_source(data_source = data_source)
library(DXMarkers)
data_source <- "CellMarker2.0"
data <- view_data_source(data_source = data_source)
library(DXMarkers)
data_source <- "PanglaoDB"
data <- view_data_source(data_source = data_source)
DXMarkers is a meticulously designed R package with the primary goal of providing efficient tools for manual cell type annotation in single-cell sequencing (scRNA-seq) analysis. It aims to help users swiftly and accurately retrieve marker information from various databases. Whether you are a newcomer to single-cell data analysis or an experienced researcher, DXMarkers can save you time and enhance your efficiency.
If you encounter any challenges during any of the aforementioned processes or have valuable suggestions, such as the desire to include additional datasets from single-cell annotation databases, you are welcome to leave a message on the WeChat official account "Garden of Da Xuan 🐾." Da Xuan 🐾 will promptly respond to your message! Alternatively, you can also reach out to Da Xuan via email at [email protected], enabling us to share knowledge, exchange insights, and grow together.
👀 For those interested in adding more datasets from single-cell annotation databases, kindly send an email or leave a message on the official account "Garden of Da Xuan 🐾" in the format "Database Name + Website + Screenshot of the Download Button." This ensures that the relevant database supports the download of all annotation data. Da Xuan 🐾 warmly welcomes everyone to provide constructive suggestions and contribute additional datasets, further enhancing the comprehensiveness and reliability of DXMarkers. Hehe! We look forward to your messages! 😀 Keep up the good work!
Hehe! You can begin with using automated annotation software like scCATCH and SingleR to gain a general understanding of possible cell types. Then, utilize DXMarkers for manual annotation! 🎉
For those seeking more assistance with SingleR and scCATCH, feel free to reach out and leave a message for Da Xuan 🐾. Of course, Da Xuan is also excited to learn about better automated annotation tools from you! We await your contributions!
Below are the results of Da Xuan using SingleR and scCATCH for automated annotation, which you may find helpful:
Hey! It all started when I was manually annotating cells and realized that searching databases one by one and manually copying and pasting information into Excel was too time-consuming. After seeking advice from several research groups and conducting searches on WeChat official accounts, Google, Bing, Baidu, and other search engines, I discovered a lack of a quick and batch querying tool like this. So, I decided to write functions and create an R package to enable fast and batch querying of cell types associated with markers.
Please note that, similar to many existing automated annotation tools, DXMarkers cannot fully replace the actions of manually querying databases and studying marker biology through scientific literature. Its purpose is to serve as an auxiliary tool to boost cell annotation efficiency.
This is my first time creating an R package, and there might be room for improvement. I sincerely appreciate everyone's understanding and invitation to offer valuable suggestions, aiding in Da Xuan's rapid growth!
-
OpenAI. (2021). ChatGPT (Version GPT-3.5 architecture). Retrieved from https://openai.com/chatgpt/.
-
CellMarker:[http://xteam.xbio.top/CellMarker/](http://xteam.xbio.top/CellMarker/%5D(http://xteam.xbio.top/CellMarker/))
-
CellMarker2.0:[http://117.50.127.228/CellMarker/](http://117.50.127.228/CellMarker/%5D(http://117.50.127.228/CellMarker/))
-
PanglaoDB:[https://panglaodb.se/index.html](https://panglaodb.se/index.html%5D(https://panglaodb.se/index.html))
Obviously, currently DXMarkers' functionality is not very robust, and the relevant function code is also quite simple (although, I spent two days and nights debugging the errors just to package it into an R package, sobbing 😭). However, we can still continue to further develop more powerful and practical features, such as incorporating **AI algorithms** to reduce the subjectivity of manual judgments and thereby develop an automated cell type annotation function.
In this regard, I extend an invitation to you. If you have any relevant suggestions or techniques, I welcome further discussion! Hehe! I'm still relatively inexperienced, but achieving these goals is still hopeful.