Example Workflow: CELLxGENE
+ + + Source:vignettes/example_workflow.Rmd
+ example_workflow.Rmd
Introduction +
+This vignette demonstrates a basic workflow for accessing and +analysing single-cell RNA-seq data from the CELLxGENE repository using +{laminr}. CZ CELLxGENE Discover is a +standardised collection of scRNA-seq datasets and LaminDB makes it easy +to query and access data in this repository. We will go through the +steps of finding and downloading a dataset using +{laminr}, performing some simple analysis using +{Seurat} and saving the results your own LaminDB +database.
+Before we start +
+Before we go begin, please take some time to check out the Getting
+Started vignette (vignette("laminr", package = "laminr")
).
+In particular, make sure you have run the commands in the “Initial
+Setup” section.
Once that is done, we can load the {laminr} +library.
+ +Connecting to LaminDB +
+The first thing we need to do is connect to the LaminDB database. For +this tutorial, we will connect a default instance (where we will store +results) and the CELLxGENE instance that we will search for +datasets.
+Connect to the default instance +
+We will start by connecting to your default LaminDB instance. You can
+set set the default instance using the lamin
CLI on the
+command line:
lamin connect <owner>/<name>
+Once a default instance has been set, we can connect to it with +{laminr}:
+
+db <- connect()
+#> ! schema module 'bionty' is not installed → no access to its labels & registries (resolve via `pip install bionty`)
+#> → connected lamindb: laminlabs/cellxgene
+db
+#> cellxgene
+#> Core registries
+#> $Run
+#> $User
+#> $Param
+#> $ULabel
+#> $Feature
+#> $Storage
+#> $Artifact
+#> $Transform
+#> $Collection
+#> $FeatureSet
+#> $ParamValue
+#> $FeatureValue
+#> Additional modules
+#> bionty
This gives us an object we can use to interact with the database.
+Note that only the default instance can create new +records. This tutorial assumes you have access to an instance where you +have permission to add data.
+Track data provenance +
+Before we start, we will track the code that is run in this +notebook.
+
+db$track("I8BlHXFXqZOG0000", path = "example_workflow.Rmd")
Tip: The ID should be obtained by running
+db$track(path = "example_workflow.Rmd")
and copying the ID
+from the output.
Connect to the CELLxGENE instance +
+We can connect to other instances by providing a slug to the
+connect()
function. Instances connected to in this way can
+be used to query data but cannot make any changes. Let’s connect to the
+CELLxGENE instance:
+cellxgene <- connect("laminlabs/cellxgene")
+cellxgene
+#> cellxgene
+#> Core registries
+#> $Run
+#> $User
+#> $Param
+#> $ULabel
+#> $Feature
+#> $Storage
+#> $Artifact
+#> $Transform
+#> $Collection
+#> $FeatureSet
+#> $ParamValue
+#> $FeatureValue
+#> Additional modules
+#> bionty
Downloading a dataset +
+In Lamin, artifacts are objects that contain information (single-cell +data, images, data frames etc.) as well as associated metadata. You can +see what artifacts are available using the database instance object.
+
+cellxgene$Artifact$df(limit = 5)
+#> id suffix X_accessor n_objects visibility
+#> 1 2846 tiledbsoma 290 1
+#> 2 3665 tiledbsoma 330 1
+#> 3 1270 .h5ad AnnData NA 1
+#> 4 2840 .ipynb <NA> NA 0
+#> 5 2842 .html <NA> NA 0
+#> key
+#> 1 cell-census/2023-12-15/soma
+#> 2 cell-census/2024-07-01/soma
+#> 3 cell-census/2023-07-25/h5ads/7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad
+#> 4 <NA>
+#> 5 <NA>
+#> uid size hash
+#> 1 FYMewVq5twKMDXVy0000 635848093433 Mfyw8VuqftX5REITfQH_yg
+#> 2 FYMewVq5twKMDXVy0001 870700998221 bzrXBPNvitSVKvb3GG38_w
+#> 3 tczTlSHFPOcAcBnfyxKA 1297573950 UlsVvBz9kMzn2r9RdoAAOg
+#> 4 JIIPyQX5l9qELPl42d75 36297 gNdUkonYgQJP_Mi3xLzt_g
+#> 5 Whyxwf3k2GjJwTPCl1FK 716529 BDGZac3qU3oLVFpO035Qhg
+#> description n_observations is_latest X_hash_type
+#> 1 Census 2023-12-15 68683222 FALSE md5-d
+#> 2 Census 2024-07-01 115556140 TRUE md5-d
+#> 3 Supercluster: Hippocampal CA1-3 74979 FALSE md5-n
+#> 4 Source of transform G69jtgzKO0eJ6K79 NA FALSE md5
+#> 5 Report of run UAAiLAi0BrLvlKnsuvP3 NA FALSE md5
+#> type created_at X_key_is_virtual
+#> 1 dataset 2024-07-12T12:12:16.091881+00:00 FALSE
+#> 2 dataset 2024-07-16T12:52:01.424629+00:00 FALSE
+#> 3 <NA> 2023-11-28T21:46:12.685907+00:00 FALSE
+#> 4 <NA> 2024-01-29T08:32:13.311741+00:00 TRUE
+#> 5 <NA> 2024-01-29T08:32:18.346499+00:00 TRUE
+#> updated_at version
+#> 1 2024-09-17T13:00:13.714256+00:00 2023-12-15
+#> 2 2024-09-17T13:01:23.739635+00:00 2024-07-01
+#> 3 2024-01-24T07:10:21.725547+00:00 2023-07-25
+#> 4 2024-01-29T08:32:13.311792+00:00 0
+#> 5 2024-01-30T09:12:06.027928+00:00 1
This is useful, but it’s not the nicest or easiest way to find a +particular dataset. Instead, we will use the Lamin Hub website to find +the data we want to load.
+-
+
- Open a browser and go to https://lamin.ai/laminlabs/cellxgene + +
- On the top toolbar, click the “Artifacts” tab +
- Use the search field and the filters to find a dataset you are +interested in. +
-
+
- We use the “Suffix” filter to find
.h5ad
files and +search for “renal cell carcinoma”
+
-
+
- Select the entry for the dataset you want to load to open a page +with more details +
- Click the copy button at the top right, this copies a command +including the ID for the artifact +
Once we have the artifact ID, we can load information about the +artifact, similar to what we see on the website. Notice that we use a +slightly different command to what we copied from the website.
+
+artifact <- cellxgene$Artifact$get("7dVluLROpalzEh8mNyxk")
+artifact
+#> Artifact(uid='7dVluLROpalzEh8mNyxk', description='Renal cell carcinoma, pre aPD1, kidney Puck_200727_12', key='cell-census/2023-12-15/h5ads/02faf712-92d4-4589-bec7-13105059cf86.h5ad', id=1742, run_id=22, hash='YNYuokfAoDFxdaRILjmU9w', size=13997860, suffix='.h5ad', storage_id=2, version='2023-12-15', _accessor='AnnData', is_latest=TRUE, transform_id=16, _hash_type='md5-n', created_at='2024-01-11T09:13:23.143694+00:00', created_by_id=1, updated_at='2024-01-24T07:17:47.009288+00:00', visibility=1, n_observations=17612, _key_is_virtual=FALSE)
So far we have only retrieved the metadata about this object. To +download the data itself we need to run another command.
+
+adata <- artifact$load()
+#> | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
+adata
+#> AnnData object with n_obs × n_vars = 17612 × 23254
+#> obs: 'n_genes', 'n_UMIs', 'log10_n_UMIs', 'log10_n_genes', 'Cell_Type', 'cell_type_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'donor_id', 'is_primary_data', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
+#> var: 'gene', 'n_beads', 'n_UMIs', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
+#> uns: 'Cell_Type_colors', 'schema_version', 'title'
+#> obsm: 'X_spatial'
This dataset has been stored as an AnnData
object.
+In the next sections we will convert it to a Seurat
object and
+perform some simple analysis.
Convert to Seurat +
+There are various approaches for converting between different +single-cell objects, some of which are described in the Interoperability +chapter of the Single-cell Best Practices book.
+Because we already have the data loaded in memory, the simplest
+option is to extract the information we need and create a new
+Seurat
object.
+seurat <- SeuratObject::CreateSeuratObject(
+ counts = Matrix::t(adata$X),
+ meta.data = adata$obs,
+)
+#> Warning: Data is of class dgRMatrix. Coercing to dgCMatrix.
+seurat
+#> An object of class Seurat
+#> 23254 features across 17612 samples within 1 assay
+#> Active assay: RNA (23254 features, 0 variable features)
+#> 1 layer present: counts
Analysis +
+We could perform any normal analysis using {Seurat} +but as an example we will calculate marker genes for each of the +annotated cell types. To make things a bit quicker we only test the +first 1000 genes but if you have a few minutes you can get results for +all features.
+
+# Set cell identities to the provided cell type annotation
+SeuratObject::Idents(seurat) <- "Cell_Type"
+# Normalise the data
+seurat <- Seurat::NormalizeData(seurat)
+#> Normalizing layer: counts
+# Test for marker genes
+markers <- Seurat::FindAllMarkers(
+ seurat,
+ features = SeuratObject::Features(seurat)[1:1000]
+)
+#> Calculating cluster Epithelial
+#> Calculating cluster Fibroblast
+#> For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
+#> (default method for FindMarkers) please install the presto package
+#> --------------------------------------------
+#> install.packages('devtools')
+#> devtools::install_github('immunogenomics/presto')
+#> --------------------------------------------
+#> After installation of presto, Seurat will automatically use the more
+#> efficient implementation (no further action necessary).
+#> This message will be shown once per session
+#> Calculating cluster Myeloid
+#> Calculating cluster Tumor
+#> Warning: The following tests were not performed:
+#> Warning: When testing Epithelial versus all:
+#> Cell group 1 has fewer than 3 cells
+# The output is a data.frame
+head(markers)
+#> p_val avg_log2FC pct.1 pct.2 p_val_adj cluster
+#> ENSG00000164283 1.030703e-89 2.7485040 0.205 0.048 2.396797e-85 Fibroblast
+#> ENSG00000116016 3.606838e-38 2.0721038 0.152 0.051 8.387340e-34 Fibroblast
+#> ENSG00000074800 5.097282e-25 -0.9810317 0.185 0.366 1.185322e-20 Fibroblast
+#> ENSG00000112715 6.663398e-18 -1.1826785 0.078 0.202 1.549507e-13 Fibroblast
+#> ENSG00000140416 1.844156e-17 -0.6994000 0.175 0.326 4.288400e-13 Fibroblast
+#> ENSG00000125810 8.916133e-15 1.8102270 0.057 0.019 2.073358e-10 Fibroblast
+#> gene
+#> ENSG00000164283 ENSG00000164283
+#> ENSG00000116016 ENSG00000116016
+#> ENSG00000074800 ENSG00000074800
+#> ENSG00000112715 ENSG00000112715
+#> ENSG00000140416 ENSG00000140416
+#> ENSG00000125810 ENSG00000125810
Store the results in LaminDB +
+Now that we have our results, we can save them to the LaminDB +instance.
+ +Render and upload the notebook +
+You can render this notebook to HTML:
+-
+
In RStudio, click the “Knit” button
+-
+
From the command line, run:
+ +
+ -
+
Or use the
+rmarkdown
package in R:++
rmarkdown::render("example_workflow.Rmd")
+
And then save it to your LaminDB instance using the
+lamin
CLI: