-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet skeletons #134
Comments
This is possibly better for a Discussion rather than an issue, but I'll continue my roaming train of thought... Rather than a Returning to the arrow point above, defining the spec in terms of an arrow schema ("neurarrow"?) would give us the feather/ IPC format for free. The implementations of parquet I've seen tend to read it into arrow format anyway. |
I can't speak for others, but I would be interested in making and using a generic modern skeleton format that lives in something like arrow, particularly if it has capabilities to be more expressive than an SWC in terms of things like synapses and other annotations (and optionally meshes, if I can ask for extra nice things). I've implemented an h5 format that does things like this within our tooling and it works perfectly well, but is special purpose and lacks some features I've come to think I want. This is a scenario where it would make a lot more sense for us connectomics folks to do the extra 20-50% effort and instead of doing yet another special purpose format, let's try to make something more general-purpose and use a core library with all the main functions which we can build our more opinionated dataset specific features on. Arrow is definitely the right kind of technology for this, between the fast IO and the new pandas implementations. (Also, |
This may be a good reference: geoparquet (and more generally geoarrow). |
I made an attempt: https://github.com/clbarnes/neurarrow I would have liked to use structs for the coordinates (they're still stored in a column-oriented way, i.e. all the Xs would be contiguous), but pandas doesn't support structs yet (it's in beta). It makes some use of list and map types too (n.b. in arrow, dictionary means a categorical type, map means key-value like python's dicts). I added some required and suggested metadata, a way of specifying optional and derived columns, and so on. |
I had just been thinking about using feather (arrow IPC) or parquet for skeleton storage and as is often the case, you've thought about it first.
Something I'd considered would be to use a null rather than -1 for the
parent_id
of root nodes. This means we're not wasting a bit to encode a single value per skeleton, and we can map node IDs onto the whole uint64 space (not that we're likely to run out of IDs, but they're not necessarily counting up from 0). AFAIK, nulls are encoded in the header of the parquet so retrieving root nodes could be much faster than checking through the whole file. If we could fix the pandas version to >=2, we could use the arrow backend and switch navis generally onto using a nullable column, but until then it's a fairly simple switch to do at the IO stage. N.B. using the arrow backend would, I think, allow an extremely fast IO mode where the memory buffer could be either dumped straight to a file or read directly by other libraries as feather format (NBLAST memory sharing?).For bundling several parquets together I'd consider tar rather than zip (tar-quet??) - parquet files are probably best compressed using internal codecs which make best use of the file's structure, and any compression over the top of that will slow IO without significant space savings. Tarballs don't have that overhead.
The text was updated successfully, but these errors were encountered: