Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use python-isal for compression/decompression #12092

Open
rhpvorderman opened this issue Jun 4, 2021 · 2 comments
Open

Use python-isal for compression/decompression #12092

rhpvorderman opened this issue Jun 4, 2021 · 2 comments

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Jun 4, 2021

Galaxy supports fastq.gz files. For anyone interested in very fast gzip compression I recommend checking out ISA-L. Which comes with an igzip application that decompresses/compresses much faster than standard gzip.
Much faster in this case means 3x faster decompression and 6x faster compression. It is available on conda-forge and can be installed with conda install -c conda-forge isa-l.

The good news is that there are also python-bindings available. These are made by me, and an extensive test set is used to ensure that it works properly. The python bindings are now used by xopen and by extension cutadapt.

Using python-isal will make decompression a lot faster. For compression there is a slight tradeoff in that the filesize will be slightly bigger as ISA-L does not support a very high compression level (but still better than gzip level 1).

EDIT: I am willing to implement this myself if there is interest. Also I forget to mention that python-isal has no dependencies (the C-library is statically linked), so there is no dependency hell.

@mvdbeek
Copy link
Member

mvdbeek commented Jun 4, 2021

If you can make these optional imports (sound like the interface is mostly compatible with gzip ?) I think that would be a nice extension.

@rhpvorderman
Copy link
Contributor Author

I see isal is now a hard dependency due to your work on #17342

I see the current gz to uncompressed converter uses gzip -dcf. However since python-isal is required, python-m isal.igzip should also be available.

To illustrate the difference I decompress a 1.6GB fastq file here:

Benchmark 1: python -m isal.igzip -cd ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      2.008 s ±  0.011 s    [User: 1.956 s, System: 0.051 s]
  Range (min … max):    1.997 s …  2.028 s    10 runs
 Benchmark 1: gzip -cd ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      8.162 s ±  0.080 s    [User: 8.103 s, System: 0.058 s]
  Range (min … max):    8.093 s …  8.375 s    10 runs
 

4 times faster! By the way, this is mostly due to gzip's code, not to zlib. If I use the pigz implementation on one thread the decompression is also faster than gzip:

Benchmark 1: pigz -p 1 -cd ~/test/5millionreads_R1.fastq.gz > /dev/null
  Time (mean ± σ):      4.123 s ±  0.025 s    [User: 4.076 s, System: 0.047 s]
  Range (min … max):    4.089 s …  4.173 s    10 runs

Still, that makes the python -m isal.igzip command two times faster than any zlib alternative for decompression.
Is there a way this could be leveraged in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/converters/gz_to_uncompressed.xml?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants