RFC: Secondary parquet index for IDs #233
Unanswered
brad-richardson
asked this question in
GERS
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Context
There are a number of use cases that could be improved with a "secondary index" to locate Overture features by ID. Exploring, visualizing and data tools all benefit from having a faster way to locate a specific feature. GERS IDs are a core feature in Overture and we should support that with additional infrastructure where needed. This proposal is in addition to the GeoParquet bbox index proposal, with similar query speed-up goals but different target use cases.
Because we don't sort or bucket on IDs in any way, most queries for an ID will require a full scan across all Overture data. Here is a sample query for a transportation segment ID (likely the last theme and type scanned if scanned lexicographically), which took 8 minutes 34 seconds originating from the US East Coast using duckdb:
Goals
Proposal
Create partitioned parquet files with IDs and minimal additional attributes, partitioned by last two characters of each ID string, which would be located at a path like this:
Suffix was chosen as it is a simple string operation, likely implemented in all clients and is sufficiently distributed for the current set of IDs. See "Alternative partitioning" below for a review of other considered partitioning schemes.
Each parquet file will have at least these columns:
Parquet attributes
Not included
Alternative partitioning
MOD(FROM_BIG_ENDIAN_64(XXHASH64(TO_UTF8(CAST(id AS VARCHAR)))
), 1000). Suffix/substring is a simple string operation and should be supported by all clients (SUBSTR(id, -2)
).prefix=08b
has 2.9B features whileprefix=080
has 3,535 features. Extending the prefix to >3 characters would create an unmanageable # of partitions.In progress/unknowns:
Beta Was this translation helpful? Give feedback.
All reactions