-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genomics data #17
Comments
I'll leave this open for future discussion, but I'm quite hesitant about this idea; see the "Designed for the 99%" section of the README. If someone is interested in putting together a genome compression benchmark using the squash-benchmark code I'd be happy to help with the squash side of things, including accepting patches to squash-benchmark-web to pull configuration in from a separate configuration file (so it's easier to publish results using custom data), but I don't think it would be appropriate to include it in this corpus. You might be interested in quixdb/squash-benchmark#35; I believe the data you link to is already covered, but if anything is missing I'd be happy to add it to the benchmark's Makefile to make it more easily testable. |
Designed for the 99%. I like that! I can definitely see that the Genomic data would not be included in that corpus. However, as the algorithms get better and faster, you're going to have to select larger and harder targets to test against. The Genomics data set would qualify as larger/harder, but then it wouldn't be in the 99%. I'll be fascinated to see how your squash corpus evolves over time to deal with this issue. Thanks again! |
I disagree that that isn't in the "98%" and I think a genomics fastq data file would be an excellent test data set. |
This might be more persuasive if you explained your position. Genomics data is obviously a huge user of compression, and an important use case for codec developers, but it's not really a useful data point for most people looking to choose a compression codec. IMHO that makes it a perfect fit for an additional genomics-specific corpus. FWIW, now that Web Assembly has stabilized a bit I plan to finish putting together this corpus soon. Unless someone presents a good argument for including genomics data, I don't plan to. |
I actually found out about Squash through the page at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and realized that the genomics dataset that is used for those tests would be an excellent addition to your corpus. They link to the page at http://hgdownload.cse.ucsc.edu/downloads.html if you want to download it directly.
The text was updated successfully, but these errors were encountered: