Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output read lengths are affected by duplicate --adapter_sequence arguments #575

Open
mwhamgenomics opened this issue Sep 20, 2024 · 0 comments

Comments

@mwhamgenomics
Copy link

I've been running fastp as part of a larger third-party pipeline (i.e. not written or maintained by me), and noticed that it was specifying adapter sequences multiple times on the command line:

    ...
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence CTGTCTCTTATACACATCT \
    ...

I tried seeing what fastp would do without the duplicate arguments, expecting to get the same results:

    ...
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    ...

But I found that in some cases my read lengths were now different - sometimes only r1 was affected, sometimes only r2, sometimes both. The adapter sequences being specified don't even appear in the fastqs in this case, so I expected them to have no effect.

Steps to reproduce:

# GiaB test data
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R{1,2}_001.fastq.gz

# fastp 0.23.4
wget http://opengene.org/fastp/fastp.0.23.4
chmod u+x fastp.0.23.4
ln -s fastp.0.23.4 fastp

# proof that the adapter sequences are absent in the fastqs - so surely should have no effect?
for f in U0a_CGATGT_L001_R*; do echo $f; for a in CTGTCTCTTATACACATCT AGATGTGTATAAGAGACAG; do zcat $f | grep -c $a; done; done

# subset to a minimal example of 3 reads known to be affected
zcat U0a_CGATGT_L001_R1_001.fastq.gz | grep -E '^@HWI-D00360:5:H814YADXX:1:1101:(3756:2236|7206:2194|5147:4880)' -A 3 --no-group-separator | head -n 12 | gzip -c > minimal_r1.fastq.gz
zcat U0a_CGATGT_L001_R2_001.fastq.gz | grep -E '^@HWI-D00360:5:H814YADXX:1:1101:(3756:2236|7206:2194|5147:4880)' -A 3 --no-group-separator | head -n 12 | gzip -c > minimal_r2.fastq.gz

# run fastp with/without duplicated --adapter_sequence args
fastp -i minimal_r1.fastq.gz -I minimal_r2.fastq.gz -o r1_trimmed.fastq.gz -O r2_trimmed.fastq.gz
    --thread 8 \
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence AGATGTGTATAAGAGACAG \
    --adapter_sequence CTGTCTCTTATACACATCT

fastp -i minimal_r1.fastq.gz -I minimal_r2.fastq.gz -o r1_trimmed_nodup.fastq.gz -O r2_trimmed_nodup.fastq.gz
    --thread 8 \
    --adapter_sequence CTGTCTCTTATACACATCT \
    --adapter_sequence AGATGTGTATAAGAGACAG

The above example consists of three reads, which were each affected in the same way in both the minimal fastqs above and the full size ones:

  • @HWI-D00360:5:H814YADXX:1:1101:7206:2194 - r1 affected
  • @HWI-D00360:5:H814YADXX:1:1101:3756:2236 - r2 affected
  • @HWI-D00360:5:H814YADXX:1:1101:5147:4880 - both affected

Do you know what could be causing this? Is it an expected use-case to specify the same adapter sequence multiple times?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant