Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link EBI data retrieval into the filter/plot/explore #4411

Merged
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
6474bbe
link EBI data retrieval into the filter/plot/explore
wee-snufkin Oct 12, 2023
6c456d0
Merge branch 'galaxyproject:main' into filter-plot-explore-EBI-input
wee-snufkin Oct 13, 2023
2e65cea
remove EBI import bit
wee-snufkin Oct 13, 2023
4e96b32
make EBI retrieval into its own tutorial
wee-snufkin Oct 13, 2023
57f6c19
tutorial almost ready
wee-snufkin Oct 13, 2023
a4dcb64
add agenda
wee-snufkin Oct 13, 2023
be45ac6
correction
wee-snufkin Oct 13, 2023
afc2340
Merge branch 'galaxyproject:main' into filter-plot-explore-EBI-input
wee-snufkin Oct 21, 2023
7c1d939
fix link
wee-snufkin Oct 21, 2023
2a7950c
fix links
wee-snufkin Oct 21, 2023
76b212a
image for EBI retrieval tutorial
wee-snufkin Oct 21, 2023
ed58e6e
Create index.md
wee-snufkin Oct 21, 2023
88372d5
workflow tests
wee-snufkin Oct 21, 2023
b482308
Merge branch 'galaxyproject:main' into filter-plot-explore-EBI-input
wee-snufkin Nov 3, 2023
cbb0dd7
add funding
wee-snufkin Nov 13, 2023
98e3c0e
Merge branch 'galaxyproject:main' into filter-plot-explore-EBI-input
wee-snufkin Nov 13, 2023
bb6d9fc
add funding correctly
wee-snufkin Nov 13, 2023
ef44a1a
Pavan's suggestion
wee-snufkin Nov 13, 2023
7e5ef83
Edit overview section
nomadscientist Nov 14, 2023
59df18a
final fix
nomadscientist Nov 14, 2023
06859e8
Added explanation/introduction at beginning
nomadscientist Nov 14, 2023
9d71842
update intro statement
nomadscientist Nov 14, 2023
0132400
fix link
nomadscientist Nov 14, 2023
1fc24c8
add data exploration into beginning
nomadscientist Nov 14, 2023
247566a
Fix introductory pieces
nomadscientist Nov 14, 2023
2450276
add data exploration into beginning
nomadscientist Nov 14, 2023
1c1bf2c
update explore dataset section
nomadscientist Nov 14, 2023
6256768
Merge branch 'filter-plot-explore-EBI-input' of https://github.com/we…
nomadscientist Nov 14, 2023
8f33254
Update labelling and metadata
nomadscientist Nov 14, 2023
34f0e92
moved renaming to make it easier to go for seurat objects in future
nomadscientist Nov 14, 2023
3e840cf
add seurat path
nomadscientist Nov 14, 2023
2a57491
minor format changes
nomadscientist Nov 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
194 changes: 194 additions & 0 deletions topics/single-cell/tutorials/EBI-retrieval/tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
---
layout: tutorial_hands_on
subtopic: datamanipulation
priority: 3
title: EBI Single Cell Expression Atlas files to AnnData | Creating preprocessed dataset for sc-RNA Filter, Plot, Explore tutorial
questions:
- How do I use EBI Single Cell Expression Atlas?
- How can I reformat and manipulate the downloads to create the right input for downstream analysis?
objectives:
- You will retrieve raw data from the EBI Single Cell Expression Atlas.
- You will manipulate the metadata and matrix files.
- You will combine the metadata and matrix files into an AnnData object for downstream analysis.

time_estimation: "15m"
key_points:
- The EMBL-EBI Single-cell expression atlas contains high quality datasets.
- Metadata manipulation is key for generating the correctly formatted resource.
- To use Scanpy tools, you have to transform your metadata into AnnData object.
contributions:
authorship:
- wee-snufkin
funding:
- elixir-fair-data

requirements:
-
type: "internal"
topic_name: single-cell
tutorials:
- scrna-case_alevin
- scrna-case_alevin-combine-datasets

follow_up_training:
-
type: "internal"
topic_name: single-cell
tutorials:
- scrna-case_basic-pipeline

tags:
- transcriptomics
nomadscientist marked this conversation as resolved.
Show resolved Hide resolved
- data management
---

# Introduction

> <agenda-title></agenda-title>
>
> In this tutorial, we will cover:
>
> 1. TOC
> {:toc}
>
{: .agenda}

# Getting data from Single Cell Expression Atlas

If you happen to be interested in analysing publicly available data, particularly from the [Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/home), you may be interested in the following tool {% cite Moreno2020.04.08.032698 %} which combines all the preprocessing steps shown in [the previous tutorial]({% link topics/single-cell/tutorials/scrna-case_alevin/tutorial.md %}) into one! For this tutorial, the dataset can be seen [at the EBI](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-6945/downloads) with experiment id of `E-MTAB-6945`.

> <hands-on-title>Retrieving data from Single Cell Expression Atlas</hands-on-title>
>
> 1. {% tool [EBI SCXA Data Retrieval](toolshed.g2.bx.psu.edu/repos/ebi-gxa/retrieve_scxa/retrieve_scxa/v0.0.2+galaxy2) %} with the following parameters:
> - *"SC-Atlas experiment accession"*: `E-MTAB-6945`
> - *"Choose the type of matrix to download"*: `Raw filtered counts`
>
{: .hands_on}

It's important to note that this matrix is processed somewhat through the SCXA pipeline, which is quite similar to the pre-processing that has been shown in this case study tutorial series, and it contains any and all metadata provided by their pipeline as well as the authors (for instance, more cell or gene annotations). So don't worry if the plots generated using this input method are slightly different!

# Metadata manipulation

Before creating an AnnData object, we need to make a small modification in experimental design table. The dataset contains information about 7 samples N701 – N707), however in the experimental design table (cell metadata) they are just numbered from 1 to 7. The plotting tool that we will going to use later will fail if the entries are integers and not categoricals, so we will change "1" to "N01" and so on. You can simply preview the experimental design dataset and move to the column "Sample Characteristic[individual]" (that's where the information about batch is - don't worry, we will rename the column header later!). Make a note of the number of that column - number 12 - we will need it to change the batch number to batch name.

> <hands-on-title> Change batch numbers into names </hands-on-title>
>
> 1. Change the datatype of `EBI SCXA Data Retrieval on E-MTAB-6945 exp_design.tsv` to `tabular`:
>
> {% snippet faqs/galaxy/datasets_change_datatype.md datatype="tabular" %}
>
> 2. {% tool [Column Regex Find And Replace](toolshed.g2.bx.psu.edu/repos/galaxyp/regex_find_replace/regexColumn1/1.0.3) %} with the following parameters:
> - *"Select cells from"*: `EBI SCXA Data Retrieval on E-MTAB-6945 exp_design.tsv`
> - *"using column"*: `c12`
wee-snufkin marked this conversation as resolved.
Show resolved Hide resolved
> - In *"Check"*:
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `1`
> - *"Replacement"*: `N01`
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `2`
> - *"Replacement"*: `N02`
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `3`
> - *"Replacement"*: `N03`
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `4`
> - *"Replacement"*: `N04`
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `5`
> - *"Replacement"*: `N05`
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `6`
> - *"Replacement"*: `N06`
> - {% icon param-repeat %} *"Insert Check"*
> - *"Find Regex"*: `7`
> - *"Replacement"*: `N07`
>
> 4. Rename {% icon galaxy-pencil %} output `Cell metadata`
>
{: .hands_on}

Now we can create an AnnData object!

# Creating AnnData object

> <hands-on-title> Task description </hands-on-title>
>
>
> 1. {% tool [Scanpy Read10x](toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_read_10x/scanpy_read_10x/1.8.1+galaxy9) %}
> 2. Make sure you are using version **1.8.1+galaxy9** of the tool (change by clicking on {% icon tool-versions %} Versions button):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wee-snufkin what is broken in the latest version?

Should we hide this latest version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the error message when I'm running this step with 1.8.1+galaxy93
image

If it's not a quick fix, we might want to hide the latest version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is fixable. We just need an older numpy version. e.g. numpy==1.21

Can you try this on EU please with the latest version? If this works I will fix the upstream package.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcm32 This is a matter of tool developer skills, if I'm not mistaken? Are you able to advise here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to get this error on instances using singularity (although not sure at the exact wrapper version). Those newer versions only have changes in the galaxy wrapper, the dependency had remained the same for a long time I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's going on with this? I'm reviewing this following @wee-snufkin instructions in the tutorial to use the older version. If this issue gets fixed, we can update it I guess

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've copied this discussion over to here so we don't lose it in the merged PR: galaxyproject/tools-iuc#5636

> ![List of available tool versions shown when clicking on the 'Versions' button on the top of the page.](../../images/scrna-casestudy/version.png "How to change the version of the tool")
> 3. Set the following parameters:
> - *"Expression matrix in sparse matrix format (.mtx)"*: `EBI SCXA Data Retrieval on E-MTAB-6945 matrix.mtx (Raw filtered counts)`
> - *"Gene table"*: `EBI SCXA Data Retrieval on E-MTAB-6945 genes.tsv (Raw filtered counts)`
> - *"Barcode/cell table"*: `EBI SCXA Data Retrieval on E-MTAB-6945 barcodes.tsv (Raw filtered counts)`
> - *"Cell metadata table"*: `Cell metadata`
>
> 4. Rename {% icon galaxy-pencil %} output `AnnData object`
>
{: .hands_on}

# AnnData manipulation

Now we will do several modifications within the AnnData object so that you can follow this tutorial despite the other way of getting data!
We would like to flag mitochondrial genes. They can be identified quite easily since they names start with mt. Since the tool for flagging the mitochondrial genes is case-sensitive, it might be a good idea to check what is the formatting of mitochondrial genes in our dataset.

> <hands-on-title> Check the format of mitochondrial genes names </hands-on-title>
>
> 1. {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters:
> - *"Select lines from"*: `EBI SCXA Data Retrieval on E-MTAB-6945 genes.tsv (Raw filtered counts)`
> - *"that"*: `Match`
> - *"Regular Expression"*: `mt`
> - *"Match type"*: `case insensitive`
> - *"Output"*: `Highlighted HTML (for easier viewing)`
>
> 3. Rename {% icon galaxy-pencil %} output `Mito genes check`
>
{: .hands_on}

If you click on that dataset, you will see all the genes containing 'mt' in their name. We can now clearly see that mitochondrial genes in our dataset start with 'mt-'. Keep that in mind, we will use it in a moment!

Speaking about gene names, we will also change the header of the column containing those names from `gene_symbols` to `Symbol`. This edit is only needed to make our AnnData object compatible with this tutorial's workflow.

As I mentioned at the beginning, we will also change the header of the column storing information about batch. Actually, we will change several other headers as well.

And the good news is that we can do all those steps using only one tool!

> <hands-on-title> Modify AnnData object </hands-on-title>
>
> 1. {% tool [AnnData Operations](toolshed.g2.bx.psu.edu/repos/ebi-gxa/anndata_ops/anndata_ops/1.8.1+galaxy92) %}
> 2. Make sure you are using version **1.8.1+galaxy92** of the tool (change by clicking on {% icon tool-versions %} Versions button)
> 3. Set the following parameters:
> - In *"Input object in hdf5 AnnData format"*: `AnnData object`
> - In *"Change field names in AnnData observations"*:
> - {% icon param-repeat %} *"Insert Change field names in AnnData observations"*
> - *"Original name"*: `Sample Characteristic[genotype]`
> - *"New name"*: `genotype`
> - {% icon param-repeat %} *"Insert Change field names in AnnData observations"*
> - *"Original name"*: `Sample Characteristic[individual]`
> - *"New name"*: `batch`
> - {% icon param-repeat %} *"Insert Change field names in AnnData observations"*
> - *"Original name"*: `Sample Characteristic[sex]`
> - *"New name"*: `sex`
> - {% icon param-repeat %} *"Insert Change field names in AnnData observations"*
> - *"Original name"*: `Sample Characteristic[cell type]`
> - *"New name"*: `cell_type`
> - In *"Change field names in AnnData var"*:
> - {% icon param-repeat %} *"Insert Change field names in AnnData var"*
> - *"Original name"*: `gene_symbols`
> - *"New name"*: `Symbol`
> - *"Gene symbols field in AnnData"*: `Symbol`
> - In *"Flag genes that start with these names"*:
> - {% icon param-repeat %} *"Insert Flag genes that start with these names"*
> - *"Starts with"*: `mt-`
> - *"Var name"*: `mito`
>
> 4. Rename {% icon galaxy-pencil %} output `Mito-counted AnnData for downstream analysis`
>
{: .hands_on}

And that's all! What's even more exciting about AnnData Operations tool is that it automatically calculates a bunch of metrics, such as log1p_mean_counts, log1p_total_counts, mean_counts, n_cells, n_cells_by_counts, n_counts, pct_dropout_by_counts, total_counts. Amazing, isn't it?

# Conclusion
Now you can use this object as input for the [Filter, Plot, Explore tutorial]({% link topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md %}) and its associated workflow!

Even though this tutorial was designed specifically to modify the AnnData object to be compatible with the subsequent tutorial, it also shows useful tools that you can use for your own, independent data analysis. You can find the [workflow](https://singlecell.usegalaxy.eu/u/j.jakiela/w/ebi-single-cell-expression-atlas-files-to-anndata) and the [answer key history](https://singlecell.usegalaxy.eu/u/j.jakiela/h/ebi-single-cell-expression-atlas-files-to-anndata-1). However, if you want to use the workflow from this tutorial, you have to keep in mind that different datasets may have different column names. So you have to check them first, and only then you can modify them.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
- doc: Test outline for EBI-Single-Cell-Expression-Atlas-files-to-AnnData
job: {}
outputs:
Mito genes check:
path: test-data/Mito genes check.html
AnnData object:
path: test-data/AnnData object.h5ad
Mito-counted AnnData for downstream analysis:
path: test-data/Mito-counted AnnData for downstream analysis.h5ad
Cell metadata:
path: test-data/Cell metadata.tabular
Loading
Loading