DuckDB processing #23

cholmes · 2023-07-28T03:40:43Z

cholmes
Jul 28, 2023

Just wanted to start a thread for people to share any SQL or scripts they're doing with DuckDB with Overture dataset. I picked the buildings category, since that's where I'm working, but consider this open for all. (would be nice to have more discussion categories, like one for 'tools' or something).

I mostly wanted to share my scripts and running notes on queries

Notes: (the repo name is aspirational - I hope to make tutorials, but mostly used this as scratch to remember stuff when my DuckDB history gets lost.)

geoparquet-overture.md. The admin boundaries are actually the more fleshed out there, the buildings stuff I've tried more to turn into scripts.

Scripts:

overture-buildings-parquet-add-columns.py - added quadkey and country_iso as columns to use for partitioning
overture-buildings-by-country.py - one partition attempt. Breaks up by country, orders by quadkey, and outputs valid geoparquet.

My goal with this is to get to valid geoparquet, partitioned on source.coop, building on experiments at https://beta.source.coop/cholmes/google-open-buildings My hope is that it is then much faster to access by geography and get everything, leveraging duckdb & parquet magic. And then to wrap that in a script that lets people input their geometry and desired output format, and it gets it all for them. Ideally the core data would be distributed like this, but I'll aim to prove it out first.

Please everyone share what you're doing with DuckDB - big and small! It can just be nice SQL queries that others might be interested in.

cholmes · 2023-07-28T03:46:18Z

cholmes
Jul 28, 2023
Author

Note that I've run the first script on 2/3's of the data and am trying to upload to source.coop, but my internet is slow. And hoping to get all the data in one duckdb and do a partition to get on source. But I'm on vacation the next couple weeks, and much of it with slow internet, so doubt I'll get everything up. If anyone else has the ability to run those and upload to the cloud go for it.

3 replies

cholmes Jul 30, 2023
Author

A bulk of the data is now up at https://beta.source.coop/cholmes/overture and I've got a basic readme. The rest I think may make it up over the next few days. I have one hive partition folder for the US, since it was too big to process on its own. I hope to split up the rest of the bigger countries (like over 2 gig), and to make a hive partition for each.

I've yet to test it much at all, if anyone can run some queries against it and let me know how it goes that'd be awesome. Particularly sub-country level. I suspect the order by quadkey likely will help spatial queries, if you use the BBOX struct.

bdon Jul 31, 2023
Maintainer

Link should be https://beta.source.coop/cholmes/overture right? Thanks for sharing!

cholmes Jul 31, 2023
Author

Yes! Fixed - thanks for catching that.

bdon · 2023-07-28T04:27:14Z

bdon
Jul 28, 2023
Maintainer

Here's how I prepare a CSV file of places for input into tippecanoe from an on-disk copy of the dataset:

COPY (select json_extract(bbox,'$.minx') as x,
        json_extract(bbox,'$.miny') as y,
        json_extract_string(names, '$.common[0].value') as name,
        json_extract_string(categories, '$.main') as category_main,
        from read_parquet('overture/theme=places/type=place/*')) TO 'pois.csv' (HEADER, DELIMITER ',');

Since this is only a point dataset I am just reading two values from the bbox as a shortcut, instead of parsing the WKB geometry.
The json_extract_string is necessary to strip quotes DuckDB likes to put on untyped JSON values.
I run this with duckdb -s query.sql to write out an intermediate .csv. It would be much nicer to have DuckDB read from stdin and write to a stdout piped directly into tippecanoe, but I haven't figured out if there's a right combination of flags to do this.

Update: Demo tiled output at https://github.com/bdon/overture-tiles

7 replies

bdon Jul 28, 2023
Maintainer

@PostholerCom can you show us how your GDAL/OGR based translation handles a non-tabular schema like the current Overture dataset?

PostholerCom Jul 28, 2023

ogr2ogr -dialect sqlite -sql @admins.sql -nln over -overwrite admins.gpkg /vsis3/overturemaps-us-west-2/release/2023-07-26-alpha.0

In all fairness it does not handle the names/sources JSON columns....yet. You can view my cheat sheet:

https://www.postholer.com/articles/Overature-Cheat-Sheet

cholmes Jul 28, 2023
Author

I mean, I know GDAL/OGR pretty well, and initiated the funding that built the Parquet and Arrow drivers for it. I came to DuckDB when I was trying to work with v2 of the google buildings, which was maybe 150 gigs of CSV files? GDAL is memory bound, so it would just crap out when I needed to work with the biggest individual files, and when I needed to use the whole dataset at once - to re-partition it. I used PostGIS to do that, and found DuckDB after I did the really hard stuff, but being a columnar store written for multi-core DuckDB is way faster for some stuff. Like a count - with my 800m records in PG it took minutes, with Duck it's sub-second. DuckDB does better than GDAL with really huge datasets, as it'll just start using disk when the memory is maxed out. So that's why I ended up in DuckDB.

More philosophically, I love GDAL, it's the foundation of everything in our world. But I don't think that means we should try to convince everyone to 'learn' it. You can do most anything in the GDAL/OGR world, but it's a very particular world, with its own sets of rules that must be learned, and many of them aren't incredibly well documented. I think for geospatial to grow it needs to work with the ways others work, and let them leverage what they've already learned and just add in the 'geo' part. I think the DuckDB take is ideal - use GDAL under the hood for format support, but provide totally new ways to work with it, in a form that's accessible to a much wider audience (people who work with BI and SQL). And from a developer's perspective, 'import duckdb' and then conn.execute("Install spatial; load spatial;) is way easier than dealing with GDAL installation. And you get a full database you can code against.

bdon Jul 28, 2023
Maintainer

@PostholerCom so what is your opposition to DuckDB as an alternate client to GDAL/OGR for reading Overture Parquet files? If they can be used for equivalent tasks, and DuckDB is easier to package and deploy - that seems like it makes the data product more accessible, not less. DuckDB is being developed within a larger general-purpose ecosystem and Parquet is an open format.

PostholerCom Jul 28, 2023

@cholmes I respect and appreciate what you've done. Full stop. Thank you! @bdon I have zero opposition to duckdb. None.

Over the last 30+ years I've watched software come and go, vendors come and go, support come and go. Node packages become abandoned, python modules become abandoned. It's happening as we speak. If I was a betting man I'd say the same for Big Data and containers. (Yikes! I said that in public).

In the long term, in my misguided opinion, I think developers & admins are better served by keeping it as simple as possible. Columnar data is hot today, maybe not so much tomorrow? In the long term, using core tech (GDAL as an example) will best serve us all.

That's why I support a GDAL only approach.

Maxxen · 2023-07-28T08:14:49Z

Maxxen
Jul 28, 2023

Super cool to see the interest around DuckDB spatial that this has created! I'm well aware that some things are still a little rough around the edges, e.g. the lack of spatial indexes, slow spatial joins and lack of predicate pushdown into structs. While I really wish I could bring these features to you sooner, I'll just do a shameless plug and mention that we've not been able to secure any funding for development on the spatial extension in particular (either through client projects or support contracts) and I therefore have to spend more of my energy on other projects. That said we're really grateful to just receive feedback and bug reports, and I'll make a note of any limitations people end up running into or use cases to support in the future.

0 replies

Sok46 · 2023-09-15T08:15:33Z

Sok46
Sep 15, 2023

I used the example from the instructions for DuckDB and tried to remake it to match the attributes of buildings. But the code doesn't work. What am I doing wrong?

db = duckdb.connect()
db.execute("INSTALL spatial")
db.execute("INSTALL httpfs")
db.execute("""
LOAD spatial;
LOAD httpfs;
SET s3_region='eu-central-1';
""")

db.execute("""
COPY (
    SELECT
           type,
           names,
           height,
           numFloors,
           class,
           JSON(names) AS names,
           JSON(sources) AS sources,
           ST_GeomFromWkb(geometry) AS geometry
      FROM read_parquet('s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=buildings/type=*/*', filename=true, hive_partitioning=1)
     WHERE ST_GeometryType(ST_GeomFromWkb(geometry)) IN ('POLYGON','MULTIPOLYGON')
    limit 100
) TO 'buildings.geojson'
WITH (FORMAT GDAL, DRIVER 'GeoJSON');
""")

Can you show an example of DuckDB code for issuing geojson buildings for a specific city?
I'm new to this, please forgive me if the question seems stupid

1 reply

mtravis Sep 19, 2023
Collaborator

This example which is almost a carbon copy of yours works for me in the DuckDB cli but not in python at first.

Turns out I was on version 0.7 in python so upgraded to 0.8 and your example worked.

BruceHarold · 2024-01-29T15:44:50Z

BruceHarold
Jan 29, 2024

I did a blog post for the Esri community board I run. I really should have read the data into a dataframe and let that handle the field schema, next time maybe, but anyway a good exercise.

2 replies

Maxxen Jan 29, 2024

FYI, querying nested structs in remote parquet files is going to be a lot faster next release, see duckdb/duckdb#10314

jwass Jan 29, 2024
Collaborator

Also - we're going to work on better spatial partitioning on the Overture side so that row group filtering using the bbox will be more effective.

BruceHarold · 2024-04-18T20:28:45Z

BruceHarold
Apr 18, 2024

Here is a Notebook that extracts transportation/segments features from an arbitrary division_area in the divisions theme, which I think will be a common pattern.
I'm seeing 22-25 minutes to retrieve the data, if anyone has tips for more speed let us all know. It's a pretty big theme.

I'll get to unnesting the road column next, which is where a lot of value is.

ExtractSegments.zip

5 replies

jwass Apr 18, 2024
Collaborator

Hi @BruceHarold,
Here's how I'd do this- which runs in about 30s on my laptop

import duckdb

conn = duckdb.connect("./segments.duckdb")
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")

bbox, wkt = conn.sql("""
select
    bbox,
    ST_AsText(ST_GeomFromWKB(geometry)) AS wkt
from read_parquet('s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=divisions/type=division_area/*', filename=true, hive_partitioning=1)
where
    country = 'DE'
    and subtype = 'region'
    and names.primary = 'Berlin';"""
).fetchall()[0]

segments = conn.sql(f"""
create or replace table segments as
select
    *
from read_parquet('s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=transportation/type=segment/*', filename=true, hive_partitioning=1)
where
    bbox.xmin <= {bbox["xmax"]}
    AND bbox.xmax >= {bbox["xmin"]}
    AND bbox.ymin <= {bbox["ymax"]}
    AND bbox.ymax >= {bbox["ymin"]} 
    AND ST_Intersects(ST_GeomFromText('{wkt}'), ST_GeomFromWKB(geometry));
"""
)

This doesn't create all the views like you have so it's just pared down to the core parts.

The key part is that the row group filtering in duckdb only works on constant values (see more duckdb/duckdb#10314). So we can use python to just make them explicit literals in the query and it all works out pretty well pulling the data remotely.

Note - one reason I grabbed the geometry as wkt is because I couldn't quite figure out how to put it back in the prompt as binary. Something to figure out.

Let me know if that helps.

BruceHarold Apr 19, 2024

Thanks Jacob, I'll refactor my notebook and try this out and report back here. I went with views to do an initial bbox filter because I got an error, something like "unable to create a file .tmp" when I tried the ST_Intersects in memory.. One thing, queries on division_area can return multiple features which is why I went with min() and max() functions. Thanks for monitoring the thread.

BruceHarold Apr 19, 2024

Here is how I refactored it, thanks for the guidance. I allow for multiple division_area features being the selectors. Performance is fantastic!

ExtractSegments.zip

jwass Apr 23, 2024
Collaborator

@BruceHarold One extra thing to try is set enable_object_cache=true which should cache parquet metadata within a single session (not across new sessions). So when you query ...theme=transportation/type=segment/* more than once, the second call should be much faster. I think you'll need a recent version of duckdb with a bug fix for this to work properly.

BruceHarold Apr 23, 2024

Thank you, I added that call. Today though I'm struggling with unpacking the road property, it is monumentally challenging!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DuckDB processing #23

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 18 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DuckDB processing #23

Replies: 6 comments · 18 replies

cholmes Jul 28, 2023 Author

cholmes Jul 30, 2023 Author

bdon Jul 31, 2023 Maintainer

cholmes Jul 31, 2023 Author

bdon Jul 28, 2023 Maintainer

bdon Jul 28, 2023 Maintainer

cholmes Jul 28, 2023 Author

bdon Jul 28, 2023 Maintainer

mtravis Sep 19, 2023 Collaborator

jwass Jan 29, 2024 Collaborator

jwass Apr 18, 2024 Collaborator

jwass Apr 23, 2024 Collaborator

Replies: 6 comments 18 replies

cholmes
Jul 28, 2023
Author

cholmes Jul 30, 2023
Author

bdon Jul 31, 2023
Maintainer

cholmes Jul 31, 2023
Author

bdon
Jul 28, 2023
Maintainer

bdon Jul 28, 2023
Maintainer

cholmes Jul 28, 2023
Author

bdon Jul 28, 2023
Maintainer

mtravis Sep 19, 2023
Collaborator

jwass Jan 29, 2024
Collaborator

jwass Apr 18, 2024
Collaborator

jwass Apr 23, 2024
Collaborator