Add human genome data? #35

kyleabeauchamp · 2017-04-01T20:35:09Z

The human reference genome ("fasta" filetype) and next generation sequencing datasets ('fastq' filetype) might provide interesting additions to the benchmark. Currently the state of the art is zlib (e.g., http://www.htslib.org/benchmarks/zlib.html).

Furthermore, these folks might be a possible consumer of the squash API in their underlying C library HTSLib (https://github.com/samtools/htslib)

kyleabeauchamp · 2017-04-01T20:36:11Z

I will need to check into which files are most appropriate for inclusion based on size and license.

travisdowns · 2017-04-03T04:01:54Z

I think it's a good idea! I have also downloaded genome data because it's an interesting input for compressors and it seems a very practical target as there is a huge amount of this data in the wild.

kyleabeauchamp · 2017-04-03T06:09:58Z

Here are the links to human genome chromosome 1 and a typical next generation sequencing dataset:

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz

kyleabeauchamp · 2017-04-04T03:56:57Z

The chr1.fa.gz file seems appropriately licensed: "All the files in this directory are freely available for public use."

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/

nemequ · 2017-04-05T19:23:29Z

I don't think the main benchmark is a good place for genome data; it's a bit too specialized. I know there is a lot of data but it's from a comparatively small number of users, and I'd like to try to keep the main benchmark as general-purpose as possible. Eventually I even plan to replace the existing data with something more appropriate.

Also, please correct me if I'm wrong, but I don't think anyone really cares about performance on some of the less powerful SBCs for this type of data… Adding new data to this benchmark is very expensive (it already takes about 2 weeks to run on the slower machines).

That said, it could be interesting to create a secondary benchmark (using the same code) for this data, which would be run on only a subset of machines. We already have the unstable benchmark, adding another one for genome data (or perhaps medical data in general?) wouldn't be difficult.

The benchmark code itself (i.e., this repository) is pretty agnostic about the data you feed it; you can easily pass whatever file you want. It was designed this way so people could easily test different codecs with their own data, which seems to fit this use pretty well. It wouldn't be difficult to add a Makefile target for genome/medical (though you can always just do something like ./benchmark -o data.csv chr1.fa manually)

The real issue is the squash-benchmark-web part. Right now the datasets are hard-coded, but it would be pretty easy to move those into a configuration file (or three; one for the regular benchmark, one for unstable, one for genome/medical). I don't suppose anyone is willing to do a bit of frontend (JavaScript) work on this? I'm willing, but I'm not sure when I'll have time…

kyleabeauchamp · 2017-04-05T19:30:34Z

All your point sound reasonable.

nemequ · 2017-04-05T20:46:21Z

Okay, so how about starting with something like

diff --git a/.gitignore b/.gitignore
index 3631bbd..29a2f38 100644
--- a/.gitignore
+++ b/.gitignore
@@ -32,3 +32,5 @@
 /xargs.1
 /xml
 /x-ray
+
+/chr*.fa
diff --git a/Makefile b/Makefile
index 3907cd7..775a346 100644
--- a/Makefile
+++ b/Makefile
@@ -36,6 +36,40 @@ SNAPPY = \
 	paper-100k.pdf \
 	urls.10K
 
+GENOME = \
+	chr1.fa \
+	chr2.fa \
+	chr3.fa \
+	chr4.fa \
+	chr5.fa \
+	chr6.fa \
+	chr7.fa \
+	chr8.fa \
+	chr9.fa \
+	chr10.fa \
+	chr11.fa \
+	chr12.fa \
+	chr13.fa \
+	chr14.fa \
+	chr15.fa \
+	chr16.fa \
+	chr17.fa \
+	chr18.fa \
+	chr19.fa \
+	chr20.fa \
+	chr21.fa \
+	chr22.fa
+
+chr%.fa.gz:
+	wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@"
+
+chr%.fa: chr%.fa.gz
+	squash -kdc gzip $^ $@
+
+genome.csv: $(GENOME)
+	@if [ -e /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor -a "`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`" != "performance" ]; then echo -e "WARNING: You should switch to the 'performance' CPU governor by running\n\n\tsu -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'\n"; fi
+	./benchmark -o $@ $(sort $(GENOME)) 2>&1 | tee result.log
+
 DATA = \
 	$(CANTERBURY) \
 	$(SILESA) \

That will let you just do something like make genome.csv.

nemequ · 2017-04-05T22:47:45Z

I just pushed a "remote-data" branch, which basically does this for all datasets. Unless there are objections I'll probably go ahead and push it to master soon, but I need to know exactly which files to use for the genome data; I'm not sure what is useful and what isn't…

kyleabeauchamp · 2017-04-05T23:08:07Z

I think it would be sufficient to just look at the chr1.fa file, which is the first chromosome and is approximately 5% of the whole genome. It should be fairly representative without overly burdening the runtime of the benchmark.

nemequ · 2017-04-19T16:27:46Z

I tweaked the remote-data branch so you can benchmark any single piece of data by calling make with the file name and a csv extension; e.g., for chr1.fa, you can run chr1.fa.csv.

Unfortunately some of the decompressors fail right now; it's the ones with a buffer-to-buffer API which only accept an int for sizes (instead of size_t, or int64_t, long long, etc.), and don't encode the uncompressed size in the data. I think I'm going to have to add a max_length field to SquashCodec so Squash won't try to provide larger buffers.

kyleabeauchamp · 2017-04-24T03:51:11Z

Thanks, I've got the benchmark running. I already discovered some surprises---zstd at compression level 22 does very well both in terms of ratio and decompression speed, outperforming brotli.

This also suggests that you might want to include multiple zstd compression levels on the main benchmark page.

kyleabeauchamp · 2017-04-25T03:08:36Z

chr1 benchmark data is here: https://gist.github.com/kyleabeauchamp/e5f5d79aa153bc85d854a705a25c9166

kyleabeauchamp mentioned this issue May 5, 2017

Zstd Support samtools/htslib#530

Open

nemequ mentioned this issue May 30, 2017

Genomics data nemequ/squash-corpus#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add human genome data? #35

Add human genome data? #35

kyleabeauchamp commented Apr 1, 2017

kyleabeauchamp commented Apr 1, 2017

travisdowns commented Apr 3, 2017

kyleabeauchamp commented Apr 3, 2017

kyleabeauchamp commented Apr 4, 2017

nemequ commented Apr 5, 2017

kyleabeauchamp commented Apr 5, 2017

nemequ commented Apr 5, 2017 •

edited

Loading

nemequ commented Apr 5, 2017

kyleabeauchamp commented Apr 5, 2017

nemequ commented Apr 19, 2017

kyleabeauchamp commented Apr 24, 2017

kyleabeauchamp commented Apr 25, 2017

Add human genome data? #35

Add human genome data? #35

Comments

kyleabeauchamp commented Apr 1, 2017

kyleabeauchamp commented Apr 1, 2017

travisdowns commented Apr 3, 2017

kyleabeauchamp commented Apr 3, 2017

kyleabeauchamp commented Apr 4, 2017

nemequ commented Apr 5, 2017

kyleabeauchamp commented Apr 5, 2017

nemequ commented Apr 5, 2017 • edited Loading

nemequ commented Apr 5, 2017

kyleabeauchamp commented Apr 5, 2017

nemequ commented Apr 19, 2017

kyleabeauchamp commented Apr 24, 2017

kyleabeauchamp commented Apr 25, 2017

nemequ commented Apr 5, 2017 •

edited

Loading