Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cool! #23

Open
paleolimbot opened this issue Aug 28, 2023 · 2 comments
Open

Cool! #23

paleolimbot opened this issue Aug 28, 2023 · 2 comments

Comments

@paleolimbot
Copy link

Feel free to close, just sharing some of my experiments. This is very cool! I imagine you don't have much control over the benchmarks since you're mostly just connecting the wires for rust functions, plus, I find that benchmarking GEOS geometries is pretty difficult because putting that many external pointers into a list() makes the garbage collector go a little haywire, which may affect some things here.

Any reason why you didn't prefix the functions? I mostly just typed rsgeo:: to get the autocomplete (but maybe that's what you were aiming for).

It might be nice (and considerably faster) if your rectangle-producing functions produced a wk_rct()? This would get them plot(), st_as_sf(), etc. for free (plus lets you propagate the CRS to whatever consumes the result).

Is the geometry type as part of the class member (e.g., rs_GEOMETRYCOLLECTION) necessary? It seems like it may be more efficient to only resolve the unique geometry types when required rather than loop over the result every time to build up this information?

# https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-water_line-wkb.parquet
lines_tbl_wkb <- arrow::read_parquet(
  "~/Desktop/rscratch/geoarrow-data/ns-water/ns-water-water_line-wkb.parquet"
)

vec_wkb <- wk::wkb(lines_tbl_wkb$geometry)
vec_sf <- sf::st_as_sfc(vec_wkb)
vec_geos <- geos::as_geos_geometry(vec_wkb)
vec_s2 <- s2::s2_geog_from_wkb(vec_wkb)
vec_rsgeo <- rsgeo::as_rsgeo(vec_sf)

bench::mark(
  rsgeo::minimum_rotated_rect(vec_rsgeo),
  geos::geos_minimum_rotated_rectangle(vec_geos),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::minimum_rotated_rect(vec_rsg… 3.08s  3.08s     0.324    3.69MB        0
#> 2 geos::geos_minimum_rotated_rectangl… 5.57s  5.57s     0.180   11.07MB        0

# ...probably because rust is using ellipsoidal and s2 is using spherical (faster)
bench::mark(
  rsgeo::length_geodesic(vec_rsgeo),
  s2::s2_length(vec_s2),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::length_geodesic(vec_rsg…    8.69s    8.69s     0.115    3.69MB        0
#> 2 s2::s2_length(vec_s2)           359.53ms 362.05ms     2.76    25.85MB        0

bench::mark(
  rsgeo::length_euclidean(vec_rsgeo),
  geos::geos_length(vec_geos)
)
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::length_euclidean(vec_rsge…   251ms 265.5ms      3.77    3.69MB        0
#> 2 geos::geos_length(vec_geos)        31.7ms  32.2ms     29.9     3.69MB        0

bench::mark(
  rsgeo::bounding_box(vec_rsgeo),
  wk::wk_bbox(vec_wkb),
  wk::wk_bbox(vec_geos),
  check = FALSE
)
#> # A tibble: 3 × 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::bounding_box(vec_rsgeo)    256ms    256ms      3.91     2.1KB        0
#> 2 wk::wk_bbox(vec_wkb)              145ms    145ms      6.88   16.12KB        0
#> 3 wk::wk_bbox(vec_geos)             223ms    223ms      4.48    2.46KB        0

bench::mark(
  rsgeo::centroids(vec_rsgeo),
  s2::s2_centroid(vec_s2),
  geos::geos_centroid(vec_geos),
  check = FALSE,
  iterations = 3
)
#> # A tibble: 3 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsgeo::centroids(vec_rsgeo)      1.11s    1.11s     0.899    3.69MB     2.70
#> 2 s2::s2_centroid(vec_s2)       365.99ms 388.47ms     2.48    11.07MB     0   
#> 3 geos::geos_centroid(vec_geos) 203.58ms 206.49ms     3.54    11.07MB     0

Created on 2023-08-27 with reprex v2.0.2

@JosiahParry
Copy link
Owner

Thank you for checking this out!! I very much appreciate your feedback :)) Some thoughts as I read along.

  • I think there could be a big improvement if list ALTREP is used (probably here and in geos) that way there could be just 1 pointer instead of n.
  • The prefixing is something that was brought up by @Robinlovelace as well. This is probably because I almost only call function from their namespace and find writing geos::geos or s2::s2 etc repetitive and overly verbose. But I do now see your point!
  • rectangles: I think if I were to produce wk rectangles i'd want to be able to 1) understand wk and 2) be able to convert to and fro with wk for all geometries and 3) do it as intended as opposed to how I've faked the wk_handler() by creating sfc geometries and converting from geo-types -> sfc and using the sfc handler
  • I was inspired to have individual geometry types by sf, I quite like that about it. It's also quite nice because having the class attribute actually allows me to not loop over each geometry to figure out its type. Not every algorithm in geo is implemented for each geometry type. Having the class specify the geometry type lets me catch it before running anything somewhat like stopifnot(). And since I know the output types of the functions (due to rust type system), I can specify the class at the end without looping
  • I added parallelization to a few of these functions to pretend that they're faster now. But you're right, a lot of the computation speed is out of my hands. The bbox calculation I created from scratch so I could probably figure out a better way to do that.
lines_tbl_wkb <- arrow::read_parquet(
  "/Users/josiahparry/Downloads/ns-water-water_line-wkb.parquet"
)

vec_wkb <- wk::wkb(lines_tbl_wkb$geometry)
vec_sf <- sf::st_as_sfc(vec_wkb)
vec_geos <- geos::as_geos_geometry(vec_wkb)
vec_s2 <- s2::s2_geog_from_wkb(vec_wkb)
vec_rsgeo <- rsgeo::as_rsgeo(vec_sf)

bench::mark(
  rsgeo::minimum_rotated_rect(vec_rsgeo),
  geos::geos_minimum_rotated_rectangle(vec_geos),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                          min   median `itr/sec`
#>   <bch:expr>                                     <bch:tm> <bch:tm>     <dbl>
#> 1 rsgeo::minimum_rotated_rect(vec_rsgeo)            1.99s    1.99s     0.501
#> 2 geos::geos_minimum_rotated_rectangle(vec_geos)    6.18s    6.18s     0.162
#> # ℹ 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

bench::mark(
  rsgeo::length_geodesic(vec_rsgeo),
  s2::s2_length(vec_s2),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                             min   median `itr/sec` mem_alloc
#>   <bch:expr>                        <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 rsgeo::length_geodesic(vec_rsgeo)    1.87s    1.87s     0.535    3.69MB
#> 2 s2::s2_length(vec_s2)             346.59ms 350.58ms     2.85    25.85MB
#> # ℹ 1 more variable: `gc/sec` <dbl>

@paleolimbot
Copy link
Author

I think there could be a big improvement if list ALTREP is used (probably here and in geos) that way there could be just 1 pointer instead of n.

..and s2. I'd like to build some abstractions around ALTREP soon because I've implemented it a few times and it's very difficult to get right and it's totally undocumented. It is a lot of work to do (I'm sure you know this) which is why I haven't gotten to it. It is also mostly not a problem for the sizes of data that are typically encountered in R (although that's a chicken-and-egg problem...people might work with bigger data if it wasn't so awful).

It's also quite nice because having the class attribute actually allows me to not loop over each geometry to figure out its type.

...except for all the operations that don't require a specific geometry (notably, subsetting: vec_rsgeo[stuff]). You're making a bet that the unique geometry type will be accessed more than once, so pre-computing and caching it will pay off. Be very sure that this is the case before baking it in to the class attribute! It would be safer would be to implement geometry_class_name()...if in the future you want to implement the caching you can always update it to inspect the class name first.

I added parallelization to a few of these functions to pretend that they're faster now.

Cool! I've always wanted to do that with geos and s2...I imagine it's easier to code in rust. C++ doesn't have a built-in concept of a "thread pool" and I haven't kept up with the best ways to do this that don't involve vendoring huge swaths of unmaintainable C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants