Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genomics data #17

Open
bknowles opened this issue May 30, 2017 · 4 comments
Open

Genomics data #17

bknowles opened this issue May 30, 2017 · 4 comments

Comments

@bknowles
Copy link

I actually found out about Squash through the page at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and realized that the genomics dataset that is used for those tests would be an excellent addition to your corpus. They link to the page at http://hgdownload.cse.ucsc.edu/downloads.html if you want to download it directly.

@nemequ
Copy link
Owner

nemequ commented May 30, 2017

I'll leave this open for future discussion, but I'm quite hesitant about this idea; see the "Designed for the 99%" section of the README.

If someone is interested in putting together a genome compression benchmark using the squash-benchmark code I'd be happy to help with the squash side of things, including accepting patches to squash-benchmark-web to pull configuration in from a separate configuration file (so it's easier to publish results using custom data), but I don't think it would be appropriate to include it in this corpus.

You might be interested in quixdb/squash-benchmark#35; I believe the data you link to is already covered, but if anything is missing I'd be happy to add it to the benchmark's Makefile to make it more easily testable.

@bknowles
Copy link
Author

Designed for the 99%. I like that!

I can definitely see that the Genomic data would not be included in that corpus. However, as the algorithms get better and faster, you're going to have to select larger and harder targets to test against. The Genomics data set would qualify as larger/harder, but then it wouldn't be in the 99%.

I'll be fascinated to see how your squash corpus evolves over time to deal with this issue.

Thanks again!

@abcbarryn
Copy link

I disagree that that isn't in the "98%" and I think a genomics fastq data file would be an excellent test data set.

@nemequ
Copy link
Owner

nemequ commented May 9, 2018

I disagree that that isn't in the "98%" and I think a genomics fastq data file would be an excellent test data set.

This might be more persuasive if you explained your position.

Genomics data is obviously a huge user of compression, and an important use case for codec developers, but it's not really a useful data point for most people looking to choose a compression codec. IMHO that makes it a perfect fit for an additional genomics-specific corpus.

FWIW, now that Web Assembly has stabilized a bit I plan to finish putting together this corpus soon. Unless someone presents a good argument for including genomics data, I don't plan to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants