-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip? #1
Comments
Thanks for the feedback!
Curious, what's your workflow that uses gzipped files? I'm only aware of one popular sparse matrix library that can load gzipped files directly, SciPy, though FMM can be easily configured to do so (and the Python bindings do that already). All the others accept a plain It's probably worthwhile to add some compression benchmarks as well. I'll include GZip, but only to show how slow and a bottleneck it is. Something like lz4 or zstd would fit this application better.
Good point, I'll add something like that we I get a bit of time. I'm curious how it would go myself. |
single-cell-whatever, like single-cell rnaseq, single-cell atac, and other omics assays are (at some level) are basically huge matrices of cell x gene.AFAIR they originally used compressed matrix market, but that's undesired option. These days they use their own formats to incorporate metadata associated with rows and cols, usually on top of hdf5 or similar format. https://anndata.readthedocs.io/en/latest/ and h5seurat are (I think) the most widely adopted, but 1) it clearly targets the application 2) additionally keeps different analysis artifacts, which makes the format more complicated. |
compressed mtx seems still supported: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices |
Ok cool. So the Python side uses SciPy's loader:
FYI the SciPy loader is as slow as anything on this list (for now). I wrote FMM partly to speed it up, and it does. You can expect 20x speedup on your laptop: https://github.com/alugowski/fast_matrix_market/tree/main/python I do say "for now", because we're close to getting FMM merged into scipy, so in the future that example code should magically get faster for free for you. Based on that example it looks like R's |
cool. But just in case: I don't use mtx/mtx.gz and I think very few do these days. I appreciate a lot your endeavors because scipy's mmread is indeed fantastically slow, but I think we'd better promote some binary format as an alternative. Personally I'd much prefer trivial parquet-based / hdf5-based format to mtx/mtx.gz 😄 |
10GiB file (note machine has 16GiB RAM):
I guess it's ok. A bit faster reads, a bit slower writes. File size is smaller than MM, but larger than compressed MM. Honestly I'm disappointed in all the binary formats I've tried. They should be maxing out the IO, not going head-to-head with a human-readable archival format. Again this is on a 6-core laptop. FMM can use more cores. |
What does your hdf5 workflow look like? What sorts of performance do you see? |
Not sure I'd love using many cores for IO, but tastes differ :)
0.053 sec? strange, likely just IO bound. You have good ssd? I've ran my benckmark with dense-pretends sparse with 1GB file: reading:CPU times: user 2.35 s, sys: 778 ms, total: 3.13 s construction of scipy matrixCPU times: user 167 ms, sys: 153 ms, total: 319 ms That's mac m1, but most systems with reasonable SSD should be in the same range. Parquet has settings for better compression, just as gzip. |
For hdf5, pipeline is the same - you use Or use https://anndata.readthedocs.io/en/latest/api.html with HDF5 <> parquet: I think these days parquet is more widespread than hdf5, and tooling like fastparquet is usually better. Also parquet got traction from big tech + it is more like a standard, whereas hdf5 is more like 'standard implementation'. |
Of course inefficient I/O isn't good, and matrix market certainly isn't optimal for speed. But if those cores are idle and the operation is compute bound, which ALL of these appear to be, Parquet included, then why not use the idle cores to finish faster?
27.7 wallclock time. Notice it's a 10GB file on a machine with 16GB RAM. The point of that test is to see how well the I/O library handles matrices that take up more than half of RAM. Often they like to duplicate the matrix, either explicitly somehow or implicitly by a cache. This is one example of bad performance. The 10G file is 10x larger than 1G but takes 100x as long.
So 10x slower than what my benchmark showed? I saw 0.272s for a sparse 1GB file with Parquet. Just export your file to
I also have an M1 mac. |
I've pointed to unusually low CPU usage, now I see the reason.
benchmark makes no practical sense. If you don't have twice memory, you can't operate it in python (as practically anything creates a copy). You could appreciate that a trivial and efficient format was assembled in a couple of lines, and I put zero efforts in optimizing it. And yes, there is conversion from polars to numpy that creates a copy.
I don't have that much space on drive - mtx is 6 times larger for smaller sizes. For the sake of science, side to side, but on smaller size: saving/loading/converting to scipyWall time: 992 ms CPU times: user 686 ms, sys: 401 ms, total: 1.09 s saving/loading/converting to scipy with snappy compression, level=1Wall time: 564 ms CPU times: user 322 ms, sys: 326 ms, total: 647 ms fmm (in python, reading to scipy, includes conversion)CPU times: user 7.62 s, sys: 767 ms, total: 8.39 s filesizes:2.3G Jul 16 18:56 /tmp/sparse_coo.mtx I don't doubt you have the fastest mtx reader in the west. There are a ton of issues around using sparse in python if you're curious for some practical problems:
|
I have to disagree that large matrix benchmarks make no sense. Maybe your use case doesn't care, but there are many that do. Naturally 10GB is an extreme example, but on purpose. The slowdowns begin around 4GB. Here are reads on files going from 1GB to 8GB:
Notice the read speed goes from an effective 2GB/s to 0.25GB/s. That's a big drop. A binary format should be able to do much better than that. Maybe you're ok with writing custom code for every file, but that should not be required. It's certainly a deal breaker in a lot of applications. I'm not trying to sell Matrix Market as the future for everyone. It does have its applications (maybe you're not in the target market for it, that's ok) and benefits that I won't list here because I know you don't care about them (and that's ok). I use it here because it's the only universal format so it serves as a useful comparison target. |
There are many ways to do compression in C++, if you have favorites that would work well with this repo LMK. |
Hi, this is a random person who got this repo in github recommendations.
Comment1: cool and nice, but ungzipped matrix market is a very expensive option. Comparison with gzip is more valuable IMO
Comment2: And as a baseline, it's valuable to see something terribly simple based on columnar format, e.g.:
and see how these compares in terms of speed (because from perspective of floating precision binary formats certainly win)
The text was updated successfully, but these errors were encountered: