Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"OverflowError: FASTA/FASTQ record does not fit into buffer" when trimming ONT reads #783

Open
diego-rt opened this issue May 1, 2024 · 7 comments

Comments

@diego-rt
Copy link

diego-rt commented May 1, 2024

Hi @marcelm

I'm using cutadapt 4.4 with python 3.10.12 and I'm stumbling into this error when trimming the ultra long ULK114 adapters from a specific ONT Promethion flowcell. I'm wondering whether it is related to it having a few megabase size reads.

This is a description of the content of the file:

[diego.terrones@clip-login-1 6890b2ec397f656fd26681dc2d5e9b]$ seqkit stat -a reads.filtered.fq.gz 
file                  format  type  num_seqs        sum_len  min_len   avg_len    max_len      Q1      Q2      Q3  sum_gap     N50  Q20(%)  Q30(%)  GC(%)
reads.filtered.fq.gz  FASTQ   DNA    100,077  4,291,610,866    1,032  42,883.1  1,124,436  18,573  32,187  56,211        0  58,783   90.34   82.26   46.2

This is the command:

cutadapt --cores 4 -g GCTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA --times 5 --error-rate 0.3 --overlap 30 -m 1000 -o trimmed.fq.gz reads.filtered.fq.gz

This is the output:

This is cutadapt 4.4 with Python 3.10.12
Command line parameters: --cores 4 -g GCTTGGGTGTTTAACCGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA --times 5 --error-rate 0.3 --overlap 30 -m 1000 -o trimmed.fq.gz reads.filtered.fq.gz
Processing single-end reads on 4 cores ...
ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 87, in run
    for index, chunks in enumerate(self._read_chunks(*files)):
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 98, in _read_chunks
    for chunk in dnaio.read_chunks(files[0], self.buffer_size):
  File "/usr/local/lib/python3.10/dist-packages/dnaio/chunks.py", line 109, in read_chunks
    raise OverflowError("FASTA/FASTQ record does not fit into buffer")
OverflowError: FASTA/FASTQ record does not fit into buffer

Traceback (most recent call last):
  File "/usr/local/bin/cutadapt", line 8, in <module>
    sys.exit(main_cli())
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/cli.py", line 1061, in main_cli
    main(sys.argv[1:])
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/cli.py", line 1131, in main
    stats = run_pipeline(
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 469, in run_pipeline
    statistics = runner.run()
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 350, in run
    chunk_index = self._try_receive(connection)
  File "/usr/local/lib/python3.10/dist-packages/cutadapt/runners.py", line 386, in _try_receive
    raise e
OverflowError: FASTA/FASTQ record does not fit into buffer

Many thanks!

@marcelm
Copy link
Owner

marcelm commented May 1, 2024

Hi, that’s interesting. By default, the largest FASTQ record may have 4 million bytes. This includes the quality values, so the maximum read length is about 2 Mbp. I thought this was enough ...

There is actually a hidden (and I believe undocumented) command-line option --buffer-size that you can use to increase the buffer size. Either find out the largest read length, multiply by two and round it up a bit or try with increasingly larger sizes. For example, --buffer-size=16000000 would allow at most reads with approx. 8 Mbp.

@diego-rt
Copy link
Author

diego-rt commented May 1, 2024

Ah fantastic! I had found the corresponding line in your code and was about to edit it, but this is much more convenient.

I would say it is not rare to have reads of a few megabases with the ultra long protocols, so might be good to eventually increase the default for this buffer. I think a max read size of ~8 megabases should be pretty safe.

Thanks a lot!

@diego-rt
Copy link
Author

diego-rt commented May 1, 2024

I can confirm that --buffer-size=16000000 does the job

@diego-rt diego-rt closed this as completed May 1, 2024
@marcelm
Copy link
Owner

marcelm commented May 1, 2024

Awesome! Let me re-open this until I’ve found a more permanent solution. Maybe I can make the buffer size dynamic or so.

@marcelm marcelm reopened this May 1, 2024
@rhpvorderman
Copy link
Collaborator

You could try the following pattern:

while True:
    try:
        for chunk in dnaio.read_chunks(files[0], self.buffer_size):
            pass
    except OverFlowError:
        self.buffer_size *= 2
        logging.warning("Keep some RAM sticks at the ready!")
        continue 
    else:
        break  # or return to escape the loop

@marcelm
Copy link
Owner

marcelm commented May 1, 2024

The strategy is good, but just ignoring the exception and re-trying will lose the contents of the buffer. This would have to be done within read_chunks directly.

@rhpvorderman
Copy link
Collaborator

Whoops, you are right. I incorrectly assumed blocks were passed rather than files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants