-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add human genome data? #35
Comments
I will need to check into which files are most appropriate for inclusion based on size and license. |
I think it's a good idea! I have also downloaded genome data because it's an interesting input for compressors and it seems a very practical target as there is a huge amount of this data in the wild. |
Here are the links to human genome chromosome 1 and a typical next generation sequencing dataset: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz |
The chr1.fa.gz file seems appropriately licensed: "All the files in this directory are freely available for public use." |
I don't think the main benchmark is a good place for genome data; it's a bit too specialized. I know there is a lot of data but it's from a comparatively small number of users, and I'd like to try to keep the main benchmark as general-purpose as possible. Eventually I even plan to replace the existing data with something more appropriate. Also, please correct me if I'm wrong, but I don't think anyone really cares about performance on some of the less powerful SBCs for this type of data… Adding new data to this benchmark is very expensive (it already takes about 2 weeks to run on the slower machines). That said, it could be interesting to create a secondary benchmark (using the same code) for this data, which would be run on only a subset of machines. We already have the unstable benchmark, adding another one for genome data (or perhaps medical data in general?) wouldn't be difficult. The benchmark code itself (i.e., this repository) is pretty agnostic about the data you feed it; you can easily pass whatever file you want. It was designed this way so people could easily test different codecs with their own data, which seems to fit this use pretty well. It wouldn't be difficult to add a Makefile target for genome/medical (though you can always just do something like The real issue is the squash-benchmark-web part. Right now the datasets are hard-coded, but it would be pretty easy to move those into a configuration file (or three; one for the regular benchmark, one for unstable, one for genome/medical). I don't suppose anyone is willing to do a bit of frontend (JavaScript) work on this? I'm willing, but I'm not sure when I'll have time… |
All your point sound reasonable. |
Okay, so how about starting with something like diff --git a/.gitignore b/.gitignore
index 3631bbd..29a2f38 100644
--- a/.gitignore
+++ b/.gitignore
@@ -32,3 +32,5 @@
/xargs.1
/xml
/x-ray
+
+/chr*.fa
diff --git a/Makefile b/Makefile
index 3907cd7..775a346 100644
--- a/Makefile
+++ b/Makefile
@@ -36,6 +36,40 @@ SNAPPY = \
paper-100k.pdf \
urls.10K
+GENOME = \
+ chr1.fa \
+ chr2.fa \
+ chr3.fa \
+ chr4.fa \
+ chr5.fa \
+ chr6.fa \
+ chr7.fa \
+ chr8.fa \
+ chr9.fa \
+ chr10.fa \
+ chr11.fa \
+ chr12.fa \
+ chr13.fa \
+ chr14.fa \
+ chr15.fa \
+ chr16.fa \
+ chr17.fa \
+ chr18.fa \
+ chr19.fa \
+ chr20.fa \
+ chr21.fa \
+ chr22.fa
+
+chr%.fa.gz:
+ wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@"
+
+chr%.fa: chr%.fa.gz
+ squash -kdc gzip $^ $@
+
+genome.csv: $(GENOME)
+ @if [ -e /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor -a "`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`" != "performance" ]; then echo -e "WARNING: You should switch to the 'performance' CPU governor by running\n\n\tsu -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'\n"; fi
+ ./benchmark -o $@ $(sort $(GENOME)) 2>&1 | tee result.log
+
DATA = \
$(CANTERBURY) \
$(SILESA) \ That will let you just do something like |
I just pushed a "remote-data" branch, which basically does this for all datasets. Unless there are objections I'll probably go ahead and push it to master soon, but I need to know exactly which files to use for the genome data; I'm not sure what is useful and what isn't… |
I think it would be sufficient to just look at the chr1.fa file, which is the first chromosome and is approximately 5% of the whole genome. It should be fairly representative without overly burdening the runtime of the benchmark. |
I tweaked the remote-data branch so you can benchmark any single piece of data by calling make with the file name and a csv extension; e.g., for chr1.fa, you can run Unfortunately some of the decompressors fail right now; it's the ones with a buffer-to-buffer API which only accept an |
Thanks, I've got the benchmark running. I already discovered some surprises---zstd at compression level 22 does very well both in terms of ratio and decompression speed, outperforming brotli. This also suggests that you might want to include multiple zstd compression levels on the main benchmark page. |
chr1 benchmark data is here: https://gist.github.com/kyleabeauchamp/e5f5d79aa153bc85d854a705a25c9166 |
The human reference genome ("fasta" filetype) and next generation sequencing datasets ('fastq' filetype) might provide interesting additions to the benchmark. Currently the state of the art is zlib (e.g., http://www.htslib.org/benchmarks/zlib.html).
Furthermore, these folks might be a possible consumer of the squash API in their underlying C library HTSLib (https://github.com/samtools/htslib)
The text was updated successfully, but these errors were encountered: