Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add URLs for data to import #5074

Merged
merged 4 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Because of the long processing time for the large original files, we have downsa

# Quality control

As for any NGS data analysis, ChIP-seq data must be quality controlled before being aligned to a reference genome. For more detailed information on NGS quality control, check out the tutorial [here]({{site.baseurl}}/topics/sequence-analysis).
As for any NGS data analysis, ChIP-seq data must be quality controlled before being aligned to a reference genome. For more detailed information on NGS quality control, check [out the tutorials]({% link topics/sequence-analysis/index.md %}).

> <hands-on-title>Performing quality control</hands-on-title>
>
Expand All @@ -83,6 +83,17 @@ As for any NGS data analysis, ChIP-seq data must be quality controlled before be
>
> 2. Import the ChIP-seq raw data (\*.fastqsanger) from [Zenodo](https://doi.org/10.5281/zenodo.197100).
>
> ```
> https://zenodo.org/record/197100/files/G1E_input_R1_downsampled_SRR507859.fastqsanger
> https://zenodo.org/record/197100/files/G1E_input_R2_downsampled_SRR507860.fastqsanger
> https://zenodo.org/record/197100/files/G1E_Tal1_R1_downsampled_SRR492444.fastqsanger
> https://zenodo.org/record/197100/files/G1E_Tal1_R2_downsampled_SRR492445.fastqsanger
> https://zenodo.org/record/197100/files/Megakaryocyte_input_R1_downsampled_SRR492453.fastqsanger
> https://zenodo.org/record/197100/files/Megakaryocyte_input_R2_downsampled_SRR492454.fastqsanger
> https://zenodo.org/record/197100/files/Megakaryocyte_Tal1_R1_downsampled_SRR549006.fastqsanger
> https://zenodo.org/record/197100/files/Megakaryocytes_Tal1_R2_downsampled_SRR549007.fastqsanger
> ```
>
> {% snippet faqs/galaxy/datasets_import_via_link.md %}
>
> 3. Examine the data in a FASTQ file by clicking on the {% icon galaxy-eye %} (eye) icon.
Expand All @@ -109,8 +120,8 @@ As for any NGS data analysis, ChIP-seq data must be quality controlled before be
> > 2. Why is the quality score decreasing across the length of the reads?
> >
> > > <solution-title></solution-title>
> > > 1. The phred-score. This score gives the probability of an incorrect base *e.g.* a score of 20 means that it is likely by 1% that one base is incorrect. See [here](https://en.wikipedia.org/wiki/Phred_quality_score) for more information.
> > > 2. This is an unsolved technical issue of the sequencing machines. The longer the sequences are the more likely are errors. See [here](https://www.ecseq.com/support/ngs/why-does-the-sequence-quality-decrease-over-the-read-in-illumina) for more information.
> > > 1. The phred-score. This score gives the probability of an incorrect base *e.g.* a score of 20 means that it is likely by 1% that one base is incorrect. See [the wikipedia page on Phred](https://en.wikipedia.org/wiki/Phred_quality_score) for more information.
> > > 2. This is an unsolved technical issue of the sequencing machines. The longer the sequences are the more likely are errors. See [this article](https://www.ecseq.com/support/ngs/why-does-the-sequence-quality-decrease-over-the-read-in-illumina) for more information.
> > {: .solution }
> {: .question}
{: .hands_on}
Expand Down Expand Up @@ -152,7 +163,7 @@ It is often necessary to trim a sequenced read to remove bases sequenced with hi
# Aligning reads to a reference genome

To determine where DNA fragments originated from in the genome, the sequenced reads must be aligned to a reference genome. This is equivalent to solving a jigsaw puzzle, but unfortunately, not all pieces are unique. In principle, you could do a BLAST analysis to figure out where the sequenced pieces fit best in the known genome. Aligning millions of short sequences this way, however, can take a couple of weeks.
Nowadays, there are many read alignment programs for sequenced DNA, BWA being one of them. You can read more about the BWA algorithm and tool [here](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp324).
Nowadays, there are many read alignment programs for sequenced DNA, BWA being one of them. You can read more about the BWA algorithm and tool in {% cite Li_2009 %}.

> <hands-on-title>Aligning reads to a reference genome</hands-on-title>
>
Expand Down Expand Up @@ -205,7 +216,7 @@ Nowadays, there are many read alignment programs for sequenced DNA, BWA being on

To assess the similarity between the replicates sequencing datasets, it is a common technique to calculate the correlation of read counts for the different samples.

We expect that the replicate samples will cluster more closely to each other than to other samples. We will be use tools from the package deepTools for the next few steps. More information on deepTools can be found [here](https://deeptools.readthedocs.io/en/latest/content/list_of_tools.html).
We expect that the replicate samples will cluster more closely to each other than to other samples. We will be use tools from the package deepTools for the next few steps. More information on deepTools can be found [in deepTools' documentation](https://deeptools.readthedocs.io/en/latest/content/list_of_tools.html).

> <hands-on-title>Assessing correlation between samples</hands-on-title>
>
Expand Down Expand Up @@ -242,7 +253,7 @@ We expect that the replicate samples will cluster more closely to each other tha
> ![heatmap](../../images/tal1/plotCorrelation_heatmap_pearson_1kb.png "Heatmap of correlation matrix generated by plotCorrelation.")
{: .hands_on}

Additional information on how to interpret plotCorrelation plots can be found [here](https://deeptools.readthedocs.io/en/latest/content/tools/plotCorrelation.html#background).
Additional information on how to interpret plotCorrelation plots can be found [in deepTools' documentation](https://deeptools.readthedocs.io/en/latest/content/tools/plotCorrelation.html#background).

# Assessing IP strength

Expand Down Expand Up @@ -278,7 +289,7 @@ We will now evaluate the quality of the immunoprecipitation step in the ChIP-seq
> {: .question}
{: .hands_on}

Additional information on how to interpret plotFingerprint plots can be found [here](https://deeptools.readthedocs.io/en/latest/content/tools/plotFingerprint.html#background).
Additional information on how to interpret plotFingerprint plots can be found [in deepTools' documentation](https://deeptools.readthedocs.io/en/latest/content/tools/plotFingerprint.html#background).

# Determining TAL1 binding sites

Expand Down Expand Up @@ -373,7 +384,7 @@ We show here an alternative to Trackster, [IGV](http://software.broadinstitute.o
>
> 1. Open IGV on your local computer.
> 2. Click on each narrow peaks result file from the MACS2 computations on "display with IGV" --> "local Mouse mm10"
> 3. For more information about IGV see [here]({{site.baseurl}}/topics/introduction/tutorials/igv-introduction/tutorial.html)
> 3. For more information about IGV see [the IGV Tutorial]({% link topics/introduction/tutorials/igv-introduction/tutorial.md %})
{: .hands_on}

# Identifying unique and common TAL1 peaks between stages
Expand Down Expand Up @@ -495,7 +506,7 @@ We will now check whether the samples have more reads from regions of the genome
> >
> > > <solution-title></solution-title>
> > > 1. In an input ChIP-seq file, the expectation is that DNA fragments are uniformly sampled from the genome. This is in contrast to an IP ChIP-seq file where it is expected that certain genomic regions contain more reads (*i.e.* regions that are bound by the protein that is immunopurified). Therefore, non-uniformity of reads in the input sample could be a result of GC-bias, whereby more GC-rich fragments are preferentially amplified during PCR.
> > > 2. To answer this question, run the computeGCbias tool as described above and check out the results. What do YOU think? For more examples and information on how to interpret the results, check out the tool usage documentation [here](https://deeptools.readthedocs.io/en/latest/content/tools/computeGCBias.html#background).
> > > 2. To answer this question, run the computeGCbias tool as described above and check out the results. What do YOU think? For more examples and information on how to interpret the results, check out the tool usage documentation [in deepTools' documentation](https://deeptools.readthedocs.io/en/latest/content/tools/computeGCBias.html#background).
> > {: .solution }
> {: .question}
>
Expand All @@ -515,7 +526,7 @@ We will now check whether the samples have more reads from regions of the genome
> {: .question}
{: .hands_on}

Additional information on how to interpret computeGCbias plots can be found [here](https://deeptools.readthedocs.io/en/latest/content/tools/computeGCBias.html).
Additional information on how to interpret computeGCbias plots can be found [in deepTools' documentation](https://deeptools.readthedocs.io/en/latest/content/tools/computeGCBias.html).


# Conclusion
Expand Down
34 changes: 17 additions & 17 deletions topics/fair/tutorials/fair-ena/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,24 +232,24 @@ First, we need to confirm that your read files are in the correct format. Refer

> <hands-on-title>Linux or OSX</hands-on-title>
> #### On a Linux-based operating system
> **Step 1**:
> Compress the fastq files for the upload using gzip.
> 1. Compress the fastq files for the upload using gzip.
>
> Open the terminal on your machine then type the commands below. First move to the directory where fastq files are located, then compress the fastq files using gzip command.
> ```
> # In the command below replace '/path/to/fastq/directory' with the correct path
> cd /path/to/fastq/directory
>
> gzip *.fastq
> ```
> **Step 2**:
> To enable verification of the integrity of the uploaded fastq file, ENA requires md5 checksum for each file.
>
> Type the command below to calculate and print md5 sums to tab-separated file (for easy cut-and-paste later).
> ```
> for f in *.gz; do md5 $f | awk '{ gsub(/\(|\)/,""); print $2"\t" $4 }'; done > md5sums.tsv
> ```
> md5sums.tsv will contain a tab-separated table of fastq.gz filenames and their md5sum.
> Open the terminal on your machine then type the commands below. First move to the directory where fastq files are located, then compress the fastq files using gzip command.
>
> ```
> # In the command below replace '/path/to/fastq/directory' with the correct path
> cd /path/to/fastq/directory
>
> gzip *.fastq
> ```
>
> 2. To enable verification of the integrity of the uploaded fastq file, ENA requires md5 checksum for each file.
>
> Type the command below to calculate and print md5 sums to tab-separated file (for easy cut-and-paste later).
> ```
> for f in *.gz; do md5 $f | awk '{ gsub(/\(|\)/,""); print $2"\t" $4 }'; done > md5sums.tsv
> ```
> md5sums.tsv will contain a tab-separated table of fastq.gz filenames and their md5sum.
>
{: .hands_on}

Expand Down
Loading