-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preliminary analysis fails #26
Comments
Hey Nastassia,
Nice job getting to this point, that's awesome! The requirements for the
html report are fairly specific to the MURI Mod 3 naming conventions and
sample sheet, you're right. I think your two options from this point are:
1) adjust your sample sheet and sample names to match the format used by
Mod 3 - if you want to go this route and are having trouble matching these
formats, we can have a chat about how to do that. As long as you're using
the fields in the sample naming conventions
<https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/wiki/Preparing-Your-Data#sample-naming-conventions>
section the wiki, with no extra field or fewer fields, you should be good.
It looks like you're using "MFU-FISH" as the primer name, which won't match
our metadata sheets that call MiFish "MFU", so you'll likely need to adjust
that to "MFU"
2) download the .Rmd file that makes the html report and adjust the code
there to fit the format of your sample names and sample sheet. That
might require a bit more up front work on your part right now, but down the
line would likely require less adjustment of sample names and sample
sheets.
Cheers,
Amy
<*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))><
Amy M. Van Cise, Ph.D.
(she/her/hers)
Assistant Professor
Whale and Dolphin Ecology Lab <http://amyvancise.com>
University of Washington | School of Aquatic and Fisheries Sciences
1122 NE Boat St, Box 355020
Seattle, WA 98105
Office: SAFS 216B
206-221-6118
Need to meet with me? Let's find a time
<https://calendar.app.google/6S7FAok44L6n2TpF7>.
Where is Amy? [Summer 2023 edition]**
Monday: UW campus
Tuesday: UW campus
Wednesday: UW campus
Thursday: NOAA NWFSC Genetics lab
Friday: UW campus
**This is not exact. If you can't find me, shoot me an email and I will get
back to you.
"My paper was one long gigantic blunder from beginning to end."
-Charles Darwin
…On Wed, Oct 11, 2023 at 6:08 PM Nastassia Patin ***@***.***> wrote:
Hi team! I think I'm at a point where I can start posting issues here
rather than individually emailing people, particularly because I think some
problems will be widespread once others try to use the pipeline.
Although my "final_data" folder contains all the output files described in
the wiki, the "analysis_output" folder is always empty. The wiki says for
the preliminary analysis: "This is a quarto file that will take in the
output files from DADA2 and create plots and statistics regarding read
retention, read lengths, quality, and more." I think this is supposed to be
an HTML output? In any case it would be great to get that final step
working.
I think two issues might be preventing the preliminary analysis: 1) read
file names that are different from the required format and 2) a metadata
sheet that is different from the required format.
1.
Although I always try to rename the fastq files according to the
formula, it's possible something is off. Here is an example of a read pair
that I've renamed: MFU-FISH_001-d1-1_S1_L001_R1_001.fastq.gz and
MFU-FISH_001-d1-1_S1_L001_R2_001.fastq.gz. Simplifying the requirements for
raw read file names would be a huge help; I'm always nervous about renaming
any raw data.
2.
The sample sheets we get from our sequencing center are quite
different from the example sheet for the pipeline. I've uploaded an example
raw file ("SampleSheet.csv") as well as a modified sheet that I made
manually to try to match the example sheet ("SampleSheetUsed.csv"). In the
long run, this is a big pain in the butt! Maybe we can simplify the
metadata sheet requirements? I suspect something about the sample sheet is
preventing the preliminary analysis but I'm not sure. In the last step of
the pipeline I get an error that says "Run name not found."
SampleSheet.csv
<https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/files/12875927/SampleSheet.csv>
SampleSheetUsed.csv
<https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/files/12875930/SampleSheetUsed.csv>
—
Reply to this email directly, view it on GitHub
<#26>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZISFCULZVEO4V6M4K3ANTX647IHANCNFSM6AAAAAA542QBZI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks Amy! I can definitely work on Option 2 to edit the .Rmd file. For what it's worth, I tried changing the file names again to match the formula 100%, but that didn't help so it must be about the metadata sheet. I also think there might be a missing R module in the Docker image; see below for the full error message from my most recent analysis. In the long run, if this is a tool we want to disseminate to other labs or scientists, I think it will be important to incorporate more flexibility in the file names and formats. I may be able to help with some of that with my .Rmd edits. Will keep everyone posted. pipeline error message: [1] "Starting Taxonomy Assignment at 2023-10-12 21:21:23.264674" |
Thanks Nastassia! Keep us posted on how it goes.
I don't see an error message in what you copied, just a couple warnings
that seem to be referring to missing values in your window_values vector,
which is generated early in the dada2 pipeline. Warning messages generally
do not cause code to stop running, so there may be something else happening.
Just to clarify on future plans - as of now, we don't have plans to
formally disseminate this to other labs. That could change in the future,
but the decision would be up to Ryan. Rather, this pipeline has been
developed for use by MURI, but we make it publicly available so that folks
who would like to use it can (at their own risk, and their own
responsibility). The pipeline is fully based on previously published
metabarcoding QAQC pipelines, and I think it's in the best interest of
other labs to build their own pipelines using the existing resources, so
that they can be sure that they understand the various steps in their
pipelines and that they're optimized to fit the specific needs of that
lab/project.
<*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))><
Amy M. Van Cise, Ph.D.
(she/her/hers)
Assistant Professor
Whale and Dolphin Ecology Lab <http://amyvancise.com>
University of Washington | School of Aquatic and Fisheries Sciences
1122 NE Boat St, Box 355020
Seattle, WA 98105
Office: SAFS 216B
206-221-6118
Need to meet with me? Let's find a time
<https://calendar.app.google/6S7FAok44L6n2TpF7>.
Where is Amy? [Summer 2023 edition]**
Monday: UW campus
Tuesday: UW campus
Wednesday: UW campus
Thursday: NOAA NWFSC Genetics lab
Friday: UW campus
**This is not exact. If you can't find me, shoot me an email and I will get
back to you.
"My paper was one long gigantic blunder from beginning to end."
-Charles Darwin
…On Mon, Oct 16, 2023 at 1:39 PM Nastassia Patin ***@***.***> wrote:
Thanks Amy! I can definitely work on Option 2 to edit the .Rmd file. For
what it's worth, I tried changing the file names again to match the formula
100%, but that didn't help so it must be about the metadata sheet. I also
think there might be a missing R module in the Docker image; see below for
the full error message from my most recent analysis.
In the long run, if this is a tool we want to disseminate to other labs or
scientists, I think it will be important to incorporate more flexibility in
the file names and formats. I may be able to help with some of that with my
.Rmd edits. Will keep everyone posted.
pipeline error message:
[1] "Starting Taxonomy Assignment at 2023-10-12 21:21:23.264674"
Finished processing reference fasta.[1] "Finished Taxonomy Assignment at
2023-10-12 21:22:17.561918 ."
Warning messages:
1: In grSoftVersion() :
unable to load shared object '/usr/local/lib/R/modules//R_X11.so':
libXt.so.6: cannot open shared object file: No such file or directory
2: In min(which(window_values < primer.data$F_qual[i])) :
no non-missing arguments to min; returning Inf
3: In min(which(window_values < primer.data$F_qual[i])) :
no non-missing arguments to min; returning Inf
4: In min(which(window_values < primer.data$F_qual[i])) :
no non-missing arguments to min; returning Inf
5: In min(which(window_values < primer.data$F_qual[i])) :
no non-missing arguments to min; returning Inf
6: In min(which(window_values < primer.data$R_qual[i])) :
no non-missing arguments to min; returning Inf
7: Using all_of() outside of a selecting function was deprecated in
tidyselect 1.2.0.
ℹ See details at
https://tidyselect.r-lib.org/reference/faq-selection-context.html
finished step 1. 21:22:17
starting step 3: making the stats file... 21:22:20
finished step 3. 21:22:23
metabarcoding pipeline complete! 21:22:25
—
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZISFGWBXT77F4M3LJ6SZDX7WLQRAVCNFSM6AAAAAA542QBZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZTKOBQG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Right, sorry, it's not really an error message but I was wondering if there might be a link between the warnings and the failure to run the final step. Most likely it's due to the metadata sheet though so I'll try solving that first. Re: pipeline availability, got it, thanks for clarifying. I'll just see what I can do to get it working for our sequence data files. |
Just to confirm is it the "Report_MURI_Module3.qmd" file that generates the preliminary analyses? I can't find a .Rmd file in the file system. |
yup! qmd and Rmd files are interchangeable - qmd just refers to quarto, the
newer, spiffer version of Rmarkdown. But both read both.
<*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))>< <*)))><
Amy M. Van Cise, Ph.D.
(she/her/hers)
Assistant Professor
Whale and Dolphin Ecology Lab <http://amyvancise.com>
University of Washington | School of Aquatic and Fisheries Sciences
1122 NE Boat St, Box 355020
Seattle, WA 98105
Office: SAFS 216B
206-221-6118
Need to meet with me? Let's find a time
<https://calendar.app.google/6S7FAok44L6n2TpF7>.
Where is Amy? [Summer 2023 edition]**
Monday: UW campus
Tuesday: UW campus
Wednesday: UW campus
Thursday: NOAA NWFSC Genetics lab
Friday: UW campus
**This is not exact. If you can't find me, shoot me an email and I will get
back to you.
"My paper was one long gigantic blunder from beginning to end."
-Charles Darwin
…On Mon, Oct 16, 2023 at 2:17 PM Nastassia Patin ***@***.***> wrote:
Just to confirm is it the "Report_MURI_Module3.qmd" file that generates
the preliminary analyses? I can't find a .Rmd file in the file system.
—
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZISFDFJIUOFI6ICTCZJLDX7WP6RAVCNFSM6AAAAAA542QBZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGI4TEMZVGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi team! I think I'm at a point where I can start posting issues here rather than individually emailing people, particularly because I think some problems will be widespread once others try to use the pipeline.
Although my "final_data" folder contains all the output files described in the wiki, the "analysis_output" folder is always empty. The wiki says for the preliminary analysis: "This is a quarto file that will take in the output files from DADA2 and create plots and statistics regarding read retention, read lengths, quality, and more." I think this is supposed to be an HTML output? In any case it would be great to get that final step working.
I think two issues might be preventing the preliminary analysis: 1) read file names that are different from the required format and 2) a metadata sheet that is different from the required format.
Although I always try to rename the fastq files according to the formula, it's possible something is off. Here is an example of a read pair that I've renamed: MFU-FISH_001-d1-1_S1_L001_R1_001.fastq.gz and MFU-FISH_001-d1-1_S1_L001_R2_001.fastq.gz. Simplifying the requirements for raw read file names would be a huge help; I'm always nervous about renaming any raw data.
The sample sheets we get from our sequencing center are quite different from the example sheet for the pipeline. I've uploaded an example raw file ("SampleSheet.csv") as well as a modified sheet that I made manually to try to match the example sheet ("SampleSheetUsed.csv"). In the long run, this is a big pain in the butt! Maybe we can simplify the metadata sheet requirements? I suspect something about the sample sheet is preventing the preliminary analysis but I'm not sure. In the last step of the pipeline I get an error that says "Run name not found."
SampleSheet.csv
SampleSheetUsed.csv
The text was updated successfully, but these errors were encountered: