This tutorial provides a step-by-step guide on how to analyze a 10x sample dataset consisting on a FFPE human lung cancer tissue diagnosed with _Neuroendocrine Carcinom_a.
git clone https://github.com/Sanofi-Public/spatialone-pipeline.git
cd spatialone-pipeline
This will download all the necessary source code to run the analysis
The following script will download all the necessarily data for running the analysis from zenodo. The script also checks the file checksum to check data integrity, and then unzips it.
./download_experiment_data.sh
Alternatively, you can run the following commands to directly download and unzip the data:
curl -L -o SpatialOne_Data.tar.gz "https://zenodo.org/records/12605154/files/SpatialOne_Data.tar.gz?download=1"
tar -xzvf SpatialOne_Data.tar.gz
Make sure that the sample data has been stored under the prep folder.
Edit your .env file to reflect your system setup.
-
The HOST_DATA_PATH variable should reflect the path where the data downloaded in the previous step has been stored. If you are following this tutorial, this variable corresponds to the on where you have cloned the github project. After running step 2, the The HOST_DATA_PATH folder should contain the prep, conf and reference folders.
-
The GPU_DEVICE_ID variable should reflect the ID of your GPU (0 if only 1 GPU is available in the system). This parameter is not required if the analysis is run using only cpu.
For instance:
# Update your configurations for data and GPU
HOST_DATA_PATH=/Users/demo_user/Documents/spatialone-pipeline/ ## This path should point to the 'data' path where your experiment data is stored
GPU_DEVICE_ID=0 # select GPU ID access for docker image, 0 if you only have 1 GPU.
# Project variables, optional
REPOSITORY_NAME="spatialone-pipeline"
METAFLOW_USER="[email protected]"
METAFLOW_DEFAULT_DATASTORE="local"
PROJECT_NAME="spatialone"
You can generate the SpatialOne docker container by executing the following command:
make build
This process may take up to 20 minutes.
Alternatively, if you are using an amd architecture, you can retrieve the docker container from dockerhub:
docker pull albertpla/spatialone_amd:latest
docker tag albertpla/spatialone_amd:latest spatialone-pipeline:latest # This will rename the docker image to be coherent with the spatialone setup
The reference dataset that will be used during the cell deconvolution step needs to be stored under the reference folder using the following nomenclature reference_name_cell_atlas.h5ad. In this tutorial we'll use the luca_cell_atlas.h5ad dataset that was derived from the (Single-cell Lung Cancer Atlas)[https://luca.icbi.at/]. The reference data need to be anndata packaged following the guidelines provided by the scverse and cell2location
The reference anndata file used in this example follows this structure:
- Raw gene expression counts from the atlas are stored in /X
- Data is gzip compressed
- Single cell barcode is "experiment", Expeirment/batch tag is "batch", and cell annotation is defined in "cell_type"
Spatial one uses yaml files to set up its configuraiton. Configuration files should be stored in the conf folder; visium_config_flow.yaml is the file name used by default. We reccomend keeping this config filename so the analysis can be run by simply running make run
In this analysis we will use cellpose to segment cells in the H&E image, cell2location to estimate the cell type proportion at each step, the local assignment method to estimate cell types of each segemented cell, and we'll run a standard downstream spatial analyisis.
The first configuration block corresponds to the experiment metadata
user.id: "[email protected]" # for internal metadata handling
run.name: "SpatialOne Tutorial Example"
run.description: "Squamous Cell Carcinoma analyzed with Cellpose & Cell2location (LuCA)"
run.summary: "SCC analyiss"
experiment.ids: [
# List of experiments to analyze
# They will be analyzed concurrently using the same configuration params.
"CytAssist_FFPE_Human_Lung_Squamous_Cell_Carcinoma", # This name should match the folder name under 'prep' that contains the sample data
]
The second block corresponds to the pipelines that will be executed. As we plan to run and end-to-end analysis all will be set to True*
pipelines.enabled:
imgseg: True
cell2spot: True
celldeconv: True
cluster: True
assign: True
qc: True
datamerge: True
spatialanalysis: True
For the image segmentation block it is key to specify that we want to use cellpose and the nuclei model. The flow_threshold parameter determines how aggressive the segmentation will be, a value of 0.8 will result in most of the cells being segmented. For a deatailed description fo the parameters please check here.
imgseg:
image.resolution: "" #TBD
image.magnification: "" #TBD
model:
name: "cellpose"
version: "2.1.1" # Shouldn't this be gone?
params:
patch_size: 512
overlap: 496
downsample_factor: 1
n_channels: 3
channels: [1,0]
model_type: "nuclei"
batch_size: 100
diameter: 16
flow_threshold: 0.8
For the cell deconvolution module, we need to specify cell2location as the deconvolution method. It is essential that the atlas_type parmeter is aligned with the reference dataset name defined in step 5 (in this case, luca). Some of the key parameters to be customized are st_max_epoches, which will determine the maxium number of training epochs, and sc_use_gpu which determines if SpatialOne will use a GPU or a cpu for the training. For a deatailed description fo the parameters please check here.
celldeconv:
model:
name: "cell2location"
version: "0.1.3" # Shouldn't this be gone?
params:
seed : 2023
#params for clean sc datasets
sc_cell_count_cutoff : 20 #a int variable for gene filter. A gene should be detected over minimum of cells counts e.g. should be detected in over 20 cells
sc_cell_percentage_cutoff2 : 0.05 #(0,1) float variable for gene filter. A gene should be detected over minimum percent of cells.
sc_nonz_mean_cutoff : 1.12 # (1, ) float variable for gene filter. A gene should have mean expression across non-zero cells slightly larger than 1
#params for training sc to get st signatures
sc_batch_key: "batch" # single cell data batch category, e.g. 10X reaction / sample / batch
sc_label_key: "cell_type" # cell type, covariate used for constructing signatures
sc_categorical_covariate_keys: [] # multiplicative technical effects (platform, 3' vs 5', donor effect)
sc_max_epoches : 50
sc_lr : 0.02
sc_use_gpu : True
#parames for training st to get cell aboundance
st_N_cells_per_location : 20
st_detection_alpha : 200
st_max_epoches : 25000
cell_aboundance_threshold : 0.1 #cell aboundance threshold set, reduce aboundance below threshold to 0. Default, cell abundance >0.1 was used as a gold standard label
atlas_type : 'luca'
mode : ### model
retrain_cell_sig: True ## if need to re-train a signature or a treained signature is exist for using
For the morphological cell clustering block it is key to define the number of expected cell types (e.g. 20). In order to allow the local cell type estimation method to infer the cell types for the segmented cells, it is important to enable the spot_clustering step, as it will cluster both the morphological in the whole slide as in the spot.
cluster:
model:
name: "gaussian"
version: "X.X.X"
params:
n_clusters: 20
spot_clustering: True
assign:
model:
name: "local"
version: "X.X.X" # Shouldn't this be gone?
The cell2spot, QC, and spatialanalysis modules do not require any specific configuration:
cell2spot:
model:
name: "default"
version: "X.X.X"
qc:
model:
name: "default"
version: "X.X.X"
spatialanalysis:
model:
name: "default"
version: "X.X.X"
Finally, in the datamerge step the user can define if geojson annotations are available, as well as the number of top genes that will be considered in the downstream analyisis.
datamerge:
run_data_merge_only: True
model:
name: "default"
version: "X.X.X"
params:
annotation_file: "annotations.geojson"
spot_index: "barcode"
cell_index: "cell_ids"
target_genes: []
n_top_expressed_genes: 750 # Forces to include the top 500 most expressed genes in the reporting
n_top_variability_genes: 750
If the user has mantained the default config file name "visium_config_flow.yaml" (reccomended) the analysis can simply triggered by running the following command which will run the SpatialOne light version:
make run
or
make run-cpu
If the file name has been changed or the user wants to run the analysis using a customized set up the analyis can be triggered by running:
set -a
source .env
set +a
docker run --gpus device=${GPU_DEVICE_ID} -it -v ${HOST_DATA_PATH}:/app/data spatialone-pipeline
Alternatively, you can directly run the analysis without loading the .env environment file by replacing GPU_DEVICE_ID
and HOST_DATA_PATH
by their valures:
docker run --gpus device=0 -it -v /Users/demo_user/Documents/spatialone-pipeline/app/data spatialone-pipeline