Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add human genome data? #35

Open
kyleabeauchamp opened this issue Apr 1, 2017 · 12 comments
Open

Add human genome data? #35

kyleabeauchamp opened this issue Apr 1, 2017 · 12 comments

Comments

@kyleabeauchamp
Copy link

The human reference genome ("fasta" filetype) and next generation sequencing datasets ('fastq' filetype) might provide interesting additions to the benchmark. Currently the state of the art is zlib (e.g., http://www.htslib.org/benchmarks/zlib.html).

Furthermore, these folks might be a possible consumer of the squash API in their underlying C library HTSLib (https://github.com/samtools/htslib)

@kyleabeauchamp
Copy link
Author

I will need to check into which files are most appropriate for inclusion based on size and license.

@travisdowns
Copy link

I think it's a good idea! I have also downloaded genome data because it's an interesting input for compressors and it seems a very practical target as there is a huge amount of this data in the wild.

@kyleabeauchamp
Copy link
Author

Here are the links to human genome chromosome 1 and a typical next generation sequencing dataset:

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz

@kyleabeauchamp
Copy link
Author

The chr1.fa.gz file seems appropriately licensed: "All the files in this directory are freely available for public use."

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/

@nemequ
Copy link
Member

nemequ commented Apr 5, 2017

I don't think the main benchmark is a good place for genome data; it's a bit too specialized. I know there is a lot of data but it's from a comparatively small number of users, and I'd like to try to keep the main benchmark as general-purpose as possible. Eventually I even plan to replace the existing data with something more appropriate.

Also, please correct me if I'm wrong, but I don't think anyone really cares about performance on some of the less powerful SBCs for this type of data… Adding new data to this benchmark is very expensive (it already takes about 2 weeks to run on the slower machines).

That said, it could be interesting to create a secondary benchmark (using the same code) for this data, which would be run on only a subset of machines. We already have the unstable benchmark, adding another one for genome data (or perhaps medical data in general?) wouldn't be difficult.

The benchmark code itself (i.e., this repository) is pretty agnostic about the data you feed it; you can easily pass whatever file you want. It was designed this way so people could easily test different codecs with their own data, which seems to fit this use pretty well. It wouldn't be difficult to add a Makefile target for genome/medical (though you can always just do something like ./benchmark -o data.csv chr1.fa manually)

The real issue is the squash-benchmark-web part. Right now the datasets are hard-coded, but it would be pretty easy to move those into a configuration file (or three; one for the regular benchmark, one for unstable, one for genome/medical). I don't suppose anyone is willing to do a bit of frontend (JavaScript) work on this? I'm willing, but I'm not sure when I'll have time…

@kyleabeauchamp
Copy link
Author

All your point sound reasonable.

@nemequ
Copy link
Member

nemequ commented Apr 5, 2017

Okay, so how about starting with something like

diff --git a/.gitignore b/.gitignore
index 3631bbd..29a2f38 100644
--- a/.gitignore
+++ b/.gitignore
@@ -32,3 +32,5 @@
 /xargs.1
 /xml
 /x-ray
+
+/chr*.fa
diff --git a/Makefile b/Makefile
index 3907cd7..775a346 100644
--- a/Makefile
+++ b/Makefile
@@ -36,6 +36,40 @@ SNAPPY = \
 	paper-100k.pdf \
 	urls.10K
 
+GENOME = \
+	chr1.fa \
+	chr2.fa \
+	chr3.fa \
+	chr4.fa \
+	chr5.fa \
+	chr6.fa \
+	chr7.fa \
+	chr8.fa \
+	chr9.fa \
+	chr10.fa \
+	chr11.fa \
+	chr12.fa \
+	chr13.fa \
+	chr14.fa \
+	chr15.fa \
+	chr16.fa \
+	chr17.fa \
+	chr18.fa \
+	chr19.fa \
+	chr20.fa \
+	chr21.fa \
+	chr22.fa
+
+chr%.fa.gz:
+	wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@"
+
+chr%.fa: chr%.fa.gz
+	squash -kdc gzip $^ $@
+
+genome.csv: $(GENOME)
+	@if [ -e /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor -a "`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`" != "performance" ]; then echo -e "WARNING: You should switch to the 'performance' CPU governor by running\n\n\tsu -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'\n"; fi
+	./benchmark -o $@ $(sort $(GENOME)) 2>&1 | tee result.log
+
 DATA = \
 	$(CANTERBURY) \
 	$(SILESA) \

That will let you just do something like make genome.csv.

@nemequ
Copy link
Member

nemequ commented Apr 5, 2017

I just pushed a "remote-data" branch, which basically does this for all datasets. Unless there are objections I'll probably go ahead and push it to master soon, but I need to know exactly which files to use for the genome data; I'm not sure what is useful and what isn't…

@kyleabeauchamp
Copy link
Author

I think it would be sufficient to just look at the chr1.fa file, which is the first chromosome and is approximately 5% of the whole genome. It should be fairly representative without overly burdening the runtime of the benchmark.

@nemequ
Copy link
Member

nemequ commented Apr 19, 2017

I tweaked the remote-data branch so you can benchmark any single piece of data by calling make with the file name and a csv extension; e.g., for chr1.fa, you can run chr1.fa.csv.

Unfortunately some of the decompressors fail right now; it's the ones with a buffer-to-buffer API which only accept an int for sizes (instead of size_t, or int64_t, long long, etc.), and don't encode the uncompressed size in the data. I think I'm going to have to add a max_length field to SquashCodec so Squash won't try to provide larger buffers.

@kyleabeauchamp
Copy link
Author

Thanks, I've got the benchmark running. I already discovered some surprises---zstd at compression level 22 does very well both in terms of ratio and decompression speed, outperforming brotli.

This also suggests that you might want to include multiple zstd compression levels on the main benchmark page.

@kyleabeauchamp
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants