Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRATOOLS_FASTERQDUMP failing due to missing fastq file #317

Open
edg1983 opened this issue Jul 24, 2024 · 11 comments
Open

SRATOOLS_FASTERQDUMP failing due to missing fastq file #317

edg1983 opened this issue Jul 24, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@edg1983
Copy link

edg1983 commented Jul 24, 2024

Description of the bug

I have a list of SRR IDs in a file and am trying to download them using the pipeline. However, I constantly get an error at the SRATOOLS_FASTERQDUMP step.

This is an example of a failing .command.sh

#!/bin/bash -euo pipefail
export NCBI_SETTINGS="$PWD/user-settings.mkfg"

fasterq-dump \
    --split-files --include-technical \
    --threads 6 \
    --outfile SRX12493307_SRR16208975 \
     \
    SRR16208975

pigz \
     \
    --no-name \
    --processes 6 \
    *.fastq

cat <<-END_VERSIONS > versions.yml
"NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP":
    sratools: $(fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+')
    pigz: $( pigz --version 2>&1 | sed 's/pigz //g' )
END_VERSIONS

From the log, it seems that no fastq files are found in the folder when attempting to run pigz after faster-dump. Inspecting the working folder, I can see a file named SRX12493307_SRR16208975, which contains reads but has no .fastq extension, and thus, the Pigz command fails.

Command used and terminal output

Command used:


nextflow run nf-core/fetchngs --input SRP340133_accessions.csv --outdir ./SRP340133 -profile conda,ht_cpu -c /ssu/gassu/small_bits/fht_profile.config -c custom.conf

Here, custom.conf is used to set --max-size 50g since some of the datasets are larger than 20Gb.

This is the error message from Nextflow.

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP (SRX12493307_SRR162089
75)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP (SRX12493307_SRR16208975)` terminated with an
error exit status (1)


Command executed:

  export NCBI_SETTINGS="$PWD/user-settings.mkfg"

  fasterq-dump \
      --split-files --include-technical \
      --threads 6 \
      --outfile SRX12493307_SRR16208975 \
       \
      SRR16208975

  pigz \
       \
      --no-name \
      --processes 6 \
      *.fastq

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP":
      sratools: $(fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+')
      pigz: $( pigz --version 2>&1 | sed 's/pigz //g' )
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  spots read      : 117,423,804
  reads read      : 117,423,804
  reads written   : 117,423,804
  pigz: skipping: *.fastq does not exist


### Relevant files

_No response_

### System information

Nextflow: 24.04.2
Hardware: HPC
Executor: Slurm
Container engine: Conda
Pipeline version: 1.12
@edg1983 edg1983 added the bug Something isn't working label Jul 24, 2024
@Midnighter
Copy link
Contributor

Can you try to run that ID manually, please? Some files are simply broken.

@edg1983
Copy link
Author

edg1983 commented Jul 24, 2024

If I run fasterq-dump manually, it completes correctly, but the output file doesn't have .fastq extension (the file name is simply SRX12493307_SRR16208975). This causes subsequent pigz command to fail.

I see the same error for all the about 40 SRATOOLS_FASTERQDUMP processes, so I assume the same is happening for all of them

@Midnighter
Copy link
Contributor

Can you show your custom config, please?

@edg1983
Copy link
Author

edg1983 commented Jul 24, 2024

Yes sure. Here it is.

custom.conf

process {
    withName: SRATOOLS_PREFETCH {
        ext.args = '--max-size 50g'
    }
}

@Midnighter
Copy link
Contributor

You overwrote the default args, can you add them back as well.

--split-files --include-technical

@edg1983
Copy link
Author

edg1983 commented Jul 25, 2024

Hi.

Using the custom config, I'm adding the --max-size option to prefetch and not to fasterq-dump, right?

If I get this right, I'm not altering the option used by SRATOOLS_FASTERQDUMP process. Indeed, I can see in the .command.sh that the required options (--split-files --include-technical) are set for fasterq-dump (see the code I pasted above)

Am I missing something?

Can it be that the issue is generated by SRR records that are registered as PAIRED in the SRA Archive but apparently contain only one read? See this example https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR16208945&display=metadata

@Midnighter
Copy link
Contributor

That's my bad, I didn't pay attention to you customizing prefetch rather than dump, of course.

I can try looking at this ID myself later. In the meantime, can you remove IDs that cause errors and only run the rest?

Why did you choose sratools for download rather than files?

@edg1983
Copy link
Author

edg1983 commented Jul 25, 2024

I ran the pipeline with default settings, and my understanding is that the default download method is FTP, right?

The pipeline triggered the sratools workflow automatically for some of the entries, I guess when FASTQ files were unavailable from the FTP source?

What I can see is that 92 successful executions of the SRA_FASTQ_FTP occurred, and the SRA workflow was also run for some entries.

@Midnighter
Copy link
Contributor

I just looked at one of the runs and the read count is reported as zero. So something seems off with that run. https://www.ebi.ac.uk/ena/browser/view/SRR16208975

@Midnighter
Copy link
Contributor

Okay, I guess, the problem has a simple cause. The runs that give you problems are marked as having a paired library layout. However, in reality they seem to be single reads. Due to the pipeline expecting paired reads, the module passes an option for the output without file extension, because normally fasterq-dump will then add suffixes and extensions.

def outfile = meta.single_end ? "${prefix}.fastq" : prefix

However, since those runs are single reads, the extension is not added, which then causes pigz to fail. Perhaps, a direct way to solve this is, if you create your own prefix for those samples only. Something like:

process {
    withName: SRATOOLS_FASTERQDUMP {
        ext.prefix = { "${meta.id}.fastq" }
    }
}

@edg1983
Copy link
Author

edg1983 commented Aug 7, 2024

Hi,

Yes, this was also my hypothesis. Unfortunately, this is not the first time I have seen this kind of inconsistency in single-cell datasets in SRA. Not sure there is a way for the pipeline to catch this inconsistency from the SRA record metadata (data type == PAIRED, but the number of reads in the record == 1) and then generate a more informative warning message.

Thanks for your suggestion. I will give it a try.

Meanwhile, I went around it by editing the module script and adding a couple of lines in bash to check for the presence of files with a non-zero size, the expected name, but no fastq extension, and eventually adding the .fastq to those.

Thanks again for investigating this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants