JSON support in Arrow and DataFusion #9103
Replies: 4 comments
-
I am not aware of anything actively
I think there are several contributors who would be interested in helping Interestingly, there is a discussion of a similar topic here: #7845. I suggest we move the conversation there |
Beta Was this translation helpful? Give feedback.
-
Demand is high. I work in healthcare and there is lot of data in deeply nested JSON format. Today I have to depend on query engine capabilities to deal with struct and map functions. They often slow and don't do any vectorization of processing. Often times end up in OOM while processing large data. In general now a days the usage of JSON data has increased people are dropping relationships in favor of JSON. Specially databases like Postgresql offers JSON columns so anyone can combine concepts of relations with JSON very easily. Having a native support in arrow and vectorize processing of JSON will help tremendously. |
Beta Was this translation helpful? Give feedback.
-
👋 ClickHouse recently implemented something similar: https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse -- although they didn't mention JSON Tiles anywhere in the blog post so it might be not exactly the same. PR: ClickHouse/ClickHouse#66444 Maybe it can be used as an example implementation. |
Beta Was this translation helpful? Give feedback.
-
There is also proposal to add VARIANT type (#10987, apache/arrow-rs#6736). VARIANT is conceptually similar to JSON with slight modifications:
(i am referring to Spark's VARIANT implementation https://github.com/apache/spark/tree/master/common/variant because this one is likely to be adopted by Iceberg too and become de facto standard) Thus, if we add proper VARIANT support, do we need to also add JSON support with JSON tiles, or should we direct users towards VARIANT to get all the perf benefits? Obviously, even with excellent VARIANT support, direct and explicit support for JSON remains very very useful. |
Beta Was this translation helpful? Give feedback.
-
Two questions for this community:
I came across https://www.durner.dev/app/media/papers/json-tiles-sigmod21.pdf which describes techniques used by the Ubmra RDBMS for its JSON support. Their experiments suggest excellent performance for analytic computations over semi-structured data. There is not yet an open source implementation AFAICT but the paper is decently detailed that an impl may be possible based on the information within.
I'd describe what I'm picturing as: First, specify a serialization format for JSON tiles. Then write the compute kernel functions to operate on serialized JSON tiles. I imagine it might make sense to designate this as a
Binary
-based Arrow extension type, too.(I might also ask on the Arrow mailing list; will link here if I do so && it's possible to link)
Beta Was this translation helpful? Give feedback.
All reactions