-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indicate maturity of implementations? #66
Comments
@jeromekelleher sorry to hear about these issues. Unfortunately, because the different Zarr implementations are independent entities, I don't think it's easy to gather exhaustive, accurate information about their feature-completeness in one place (e.g., this repo). We could get part of the way there by periodically checking out the source code for those repos, building the code, and running the code against some benchmark suite. I think this could be really cool, if someone has the time to set that pipeline up. But even a library that passes these kind of tests could still have the API issues you are experiencing with jzarr. What's the goal of your benchmark? You might be interested in a lively discussion over in |
In particular, it looks like this repo is missing regular test runs, and a place to display the output of the test results, e.g. a docs page. |
I agree having exhaustive feature completeness scores would be quite a chore, and very hard to keep up to date. What's not obvious from the current list though is that most of these implementations are really just proof-of-concepts, and not actually intended for other people to build real applications on. Even just a "intended for production use" tick would be really helpful and would have saved me a lot of time. The page on the website (https://zarr.dev/implementations/) is giving the impression that all these implementations are on the same footing as zarr-python, which ultimately isn't helpful because people might randomly try a few implementations and come to the (false!) impression that the entire Zarr ecosystem is half-baked.
I'm writing a paper about sgkit, which is a (essentially) trying to bring the pydata ecosystem to the analysis of genetic variation data. We use Zarr to store the data, which I would like to emphasise is independent of sgkit and Python, as I feel that Zarr provides practical and pragmatic solutions to fundamental problems that large-scale genomics is currently struggling with. To make this point, I want to do a simple benchmark which essentially just reads through a terabyte scale dataset, doing some very simple calculations on it. The people I most want to reach here tend to be a little Python-sceptical, so hence I would like to do the benchmark in a language that is not Python. |
Perhaps a simpler solution would be to add the current version next to the name, with the assumption that implementations below v1.0.0 are less-mature. The table could be ordered according to version as well. |
Even if this suggested change were made, given the rate of activity on this repo (the last commit was 2 years ago) it would quickly become out of date. I think the bigger problem here is a demographic one -- nobody is actually working to keep this information up to date. |
I was referring to the table at https://github.com/zarr-developers/zarr-developers.github.io/blob/main/implementations/index.md - though I agree it would become out of date if done manually - perhaps a github actions workflow could be developed |
A pre-v1 vs post-v1 could also work and be less prone to become out of date |
I've spend the last few days going through the various Zarr implementations trying to create a simple read-oriented benchmark, and have had a pretty frustrating experience. Most of the implementations seems to be in a pretty early proof-of-concept phase, and I think it would be helpful to indicate the how feature-complete implementations are, and whether they have useable documentation etc.
I've almost got a java implementation going based on JZarr, but it seems to lack any form of support for reading in an efficient chunk-aware manner and the API support for getting at the ND array values is pretty limited.
(This is probably not the forum, but some advice on the best way to make a such a benchmark not using zarr-python, or advice on where I might ask for such advice would be much appreciated!)
The text was updated successfully, but these errors were encountered: