RFC: Secondary parquet index for IDs #233

brad-richardson · 2024-09-23T21:19:48Z

brad-richardson
Sep 23, 2024
Collaborator

Context

There are a number of use cases that could be improved with a "secondary index" to locate Overture features by ID. Exploring, visualizing and data tools all benefit from having a faster way to locate a specific feature. GERS IDs are a core feature in Overture and we should support that with additional infrastructure where needed. This proposal is in addition to the GeoParquet bbox index proposal, with similar query speed-up goals but different target use cases.

Because we don't sort or bucket on IDs in any way, most queries for an ID will require a full scan across all Overture data. Here is a sample query for a transportation segment ID (likely the last theme and type scanned if scanned lexicographically), which took 8 minutes 34 seconds originating from the US East Coast using duckdb:

time duckdb -c "select id, sources from read_parquet('s3://overturemaps-us-west-2/release/2024-09-18.0/theme=*/type=*/*') where id = '0895355d302bffff043fdf09acc2b72d' limit 1;"
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                id                │                                                sources                                                 │
│             varchar              │ struct(property varchar, dataset varchar, record_id varchar, update_time varchar, confidence double)[] │
├──────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0895355d302bffff043fdf09acc2b72d │ [{'property': , 'dataset': TomTom, 'record_id': NULL, 'update_time': NULL, 'confidence': NULL}]        │
└──────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘
duckdb -c   65.87s user 18.33s system 16% cpu 8:33.93 total

Goals

Quickly find feature from a given ID without a full corpus scan
Support explore site with easy look up of features by ID
No requirement of theme or type for request
Easy to use in a wide range of clients

Proposal

Create partitioned parquet files with IDs and minimal additional attributes, partitioned by last two characters of each ID string, which would be located at a path like this:

s3://overturemaps-us-west-2/id_index/2024-09-18.0/suffix=0a/index01.parquet
s3://overturemaps-us-west-2/id_index/2024-09-18.0/suffix=0b/index01.parquet
s3://overturemaps-us-west-2/id_index/2024-09-18.0/suffix=0c/index01.parquet
...

Suffix was chosen as it is a simple string operation, likely implemented in all clients and is sufficiently distributed for the current set of IDs. See "Alternative partitioning" below for a review of other considered partitioning schemes.

Each parquet file will have at least these columns:

id
theme
type
file name

Parquet attributes

Sorted by id
Not bucketed

Not included

row index within row group: this doesn't appear to be easily accessible or supported by clients for direct lookup

Alternative partitioning

suffix + theme + type as partitions: Primary expected use here is requesting a single ID or set of IDs where theme/type are unknown. Not further splitting into subpartitions will reduce needed file requests. Additional subpartitions also makes this harder to manage and use.
hash + mod: Difficult to implement in a consistent way across multiple clients. Duckdb has a hash to id function, but unknown what the underlying algorithm is. Presto/Trino has md5, xxhash64, and many others but it is not simple to convert from string ID into something bucket-able with a modulo (e.g. MOD(FROM_BIG_ENDIAN_64(XXHASH64(TO_UTF8(CAST(id AS VARCHAR)))), 1000). Suffix/substring is a simple string operation and should be supported by all clients (SUBSTR(id, -2)).
prefix: the current distribution of IDs have some non-random structure to the first half of the ID, which creates very skewed partitions. For example, prefix=08b has 2.9B features while prefix=080 has 3,535 features. Extending the prefix to >3 characters would create an unmanageable # of partitions.
longer suffix - too many partitions (2 characters = 256 max partitions, 3 characters = 4096 max)

In progress/unknowns:

Code to create this index
Example queries in common clients (duckdb, pyarrow, overturemaps-py helper)
Decide on file and row group sizes for index files
Possible additional columns:
- bbox: unsure how wide of a use case id to bbox lookup is
- row group #: difficult to use outside of low-level clients

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Secondary parquet index for IDs #233

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

RFC: Secondary parquet index for IDs #233

brad-richardson Sep 23, 2024 Collaborator

Context

Goals

Proposal

Alternative partitioning

In progress/unknowns:

Replies: 0 comments

brad-richardson
Sep 23, 2024
Collaborator