`py_to_r` refactor (c++) #1552

t-kalinowski · 2024-03-19T17:40:47Z

This PR diff is unfortunately larger than I'd like. It started with trying to fix one bug, but with the way the py_to_r() conversion routines were factored, I realized it made sense to implement the fix "upstream", in PyObjectRef, rather than at each cpp call site of py_to_r() and py_ref(). That led me to need to touch code in lots of places (and load it all in my head). As I refactored, I discovered more bugs and, with the code fresh in my mind, I also implemented some straightforward fixes and optimizations. The size of the changes snowballed, and at this point, dividing them into smaller PRs doesn't seem worth the effort.

User facing changes:

Callable python objects created with convert = FALSE now always get wrapped in a function.
(i.e., typeof(py_eval("lambda x: x", convert = FALSE)) == "function")). This now works:

fn <- py_eval("lambda x: x + 1", convert = FALSE)
fn(2) # error in current version, required using py_call()
      # now returns a py ref to '3.0' with this PR

py_to_r() S3 methods now are called on python objects supplied as args to R functions being called from Python if the R function was converted with convert = TRUE. This used to error, and now works:

myfoo <- reticulate::py_eval("type('MyFoo', (), {})")()
py_docall <- reticulate::py_eval("lambda fn, *args: fn(*args)")
py_to_r.__main__.MyFoo <- function(x) 42
registerS3method("py_to_r", "__main__.MyFoo", 
                 py_to_r.__main__.MyFoo, 
                 asNamespace("reticulate"))
py_docall(function(x) {
  if (x != 42) stop("custom py_to_r method not found")
}, myfoo)

py_to_r(x) no longer signals an error if x is not a python object. In that case, x is returned unmodified.
(This is to avoid introducing new errors in user code, if users wrote previous workarounds for S3 generics that should have been getting called, but weren't.)
attr(x, "tzone") attributes are (better) preserved when converting POSIXt to Python.
POSIXt types with a non-empty tzone attr will always convert to a datetime.datetime,
otherwise they will convert to numpy datetime64[ns] arrays.
Fixed an issue where calling py_set_item() on a subclassed dict would not invoke a custom __setitem__ method.
py_del_attr(x, name) now returns x invisibly
source_python() no longer exports assigns the r symbol
(the "R Interface object" that is used by python code get a reference to the R globalenv)

Developer facing changes:

in c++ functions, r_to_py(RObject, convert) is now all you need to go from SEXP to PyObject*
in c++ functions, py_to_r(PyObject*, convert) is now all you need to go from PyObject* to SEXP
py_to_r(PyObject*, convert = false) will now always return a PyObjectRef.
py_to_r(PyObject*, convert = true) will now invoke py_to_r() S3 methods, if the object is not a "simple" python object.
r_to_py(RObject, convert) now returns the underlying PyObject* if it is a PyObjectRef already.
The SEXP underlying PyObjectRef can now be either an R environment, or an R closure (if PyObject* is callable). All the common logic for building up a PyObjectRef for presentation in R is now done within the PyObjectRef constructor. This includes building up the closure if the PyObject is callable, and building up the S3 class attribute.
py_to_r_wrapper() S3 generic will be called on all callable PyObjectRefs
(both with convert = TRUE and convert = FALSE)
in R code there is now generally no need to call
if(py_is_module_proxy(x)) py_resolve_module_proxy(x)
or
ensure_python_initialized()
This is handled upstream in PyObjectRef or the individual c++ methods

Optimizations:

logical R arrays get converted to numpy bool arrays that are strided views of the R array (avoiding a full copy)
removed (now) unnecessary S3 methods/code that are now handled in c++
misc small speedups (e.g., repeated calls in for loops to INTEGER() and similar have been moved out)

In some ad-hoc benchmarks simulating some series of operations with python objects, I see ~2x speedup. E.g., this is ~2x faster:

{
 x <- np_array(array(runif(1000)))
 y <- np_array(array(runif(1000)))
 z <- x + y
 z <- z[x > .5 & y < .5]
 py_to_r(z)
}

Though, in some minimal benchmarks of a callable created with convert = FALSE, there is now a small slowdown due the extra work of building up an R closure (including S3 dispatch of py_to_r_wrapper()). I.e. py_eval("lambda x: x", convert = FALSE)

…path)

…l copy

n R code there is now generally no need to call if(py_is_module_proxy(x)) py_resolve_module_proxy(x) This is handled upstream in PyObjectRef()

kevinushey

Overall LGTM, but it's a large PR so it's challenging to review in depth. Are there more tests we could add to validate that we haven't changed any existing behavior unexpectedly?

src/python.cpp

kevinushey · 2024-03-19T19:11:08Z

src/python.cpp

@@ -1084,7 +1319,12 @@ SEXP py_to_r(PyObject* x, bool convert) {
  }

  // dict
-  else if (PyDict_CheckExact(x)) {
+  if (PyDict_CheckExact(x)) {
+    // if you tempted to change this to PyDict_Check() to allow subclasses, don't.


src/python.cpp

kevinushey · 2024-03-19T19:17:05Z

Callable python objects created with convert = FALSE now always get wrapped in a function.
(i.e., typeof(py_eval("lambda x: x", convert = FALSE)) == "function")).

Can you elaborate on the motivation? What if a user wants to create a Python function from R, that is later passed back into Python? A dummy example:

py_run_string("def apply(f, x): return f(x)")
fn <- py_eval("lambda x: x + 1")
main <- import_main()
main$apply(fn, 2)

t-kalinowski · 2024-03-19T19:40:36Z

The example works just fine - and should be unchanged from before:

library(reticulate)

py_run_string("def apply(f, x): return f(x)")
fn <- py_eval("lambda x: x + 1")
main <- import_main()
main$apply(fn, 2)
#> [1] 3

What is different is what happens when convert = FALSE

fn <- py_eval("lambda x: x + 1", convert = FALSE)
fn(2) # error in current version, required using py_call()
      # now returns a py ref to '3.0' with this PR

t-kalinowski · 2024-03-19T19:45:51Z

The specific bug that I was initially trying to fix was in tfdatasets:

library(tfdatasets)
make_csv_dataset("hearts.csv") |>
  dataset_map(\(x) { ... })

Where I was expecting x to come in as a named R list, but instead was coming in as an unconverted OrderedDict. Pulling at the thread, I noticed that we already define the S3 method, but that S3 methods are not called by the C++ version of py_to_r().

t-kalinowski · 2024-03-19T19:47:05Z

Closes #1024

t-kalinowski · 2024-03-19T19:53:53Z

Are there more tests we could add to validate that we haven't changed any existing behavior unexpectedly?

Do you have any specific edge-cases in mind? I already expanded some tests related to conversion, and will still add a few more related to logical numpy array conversion. Anything else?

t-kalinowski · 2024-03-19T20:00:38Z

A quick benchmark:
This snippet simulating a "real" workflow, comparing the current main branch and this PR branch (~3x faster now):

library(reticulate)
bench::mark(np_array_work = {
  x <- np_array(array(runif(1000)))
  y <- np_array(array(runif(1000)))
  z <- x + y
  w <- py_to_r(x > .5 & y < .5)
  z <- z[w]
  py_to_r(z)
})

#> # A tibble: 2 × 12
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 main_now      1.4ms   1.49ms      652.     435KB     8.25  9875   125
#> 2 refactored  457.8µs 485.67µs     1961.     308KB    13.2   9933    67
#> # ℹ 4 more variables: total_time <bch:tm>, memory <list>, time <list>,
#> #   gc <list>

kevinushey · 2024-03-19T21:42:22Z

Do you have any specific edge-cases in mind? I already expanded some tests related to conversion, and will still add a few more related to logical numpy array conversion. Anything else?

Nothing specifically, just general trepidation resulting from seeing a large PR :-)

Do we already have test coverage for confirming that the removal of these S3 methods doesn't affect anything here? https://github.com/rstudio/reticulate/pull/1552/files#diff-b6b6854a02ae177f8860f49654ff250c1554d021ae434b6efdbe99d00bdc6055L7

t-kalinowski added 30 commits March 14, 2024 11:24

import C func PyIter_Check()

1061801

simplify iter_next()

264dfd3

move as_iterator to c++

29c91b6

simplify+move py_iterate(), py_iter_next()

6685487

fix (silently failing) pandas datetime test

41aacc0

use test_path() in test source_python()

c5fa914

fix silently failing py_dict test

657bfb1

delete special external ptr capsule codepath (covered now by general …

f1aec61

…path)

conditionally use PyIter_Check (available starting in Python 3.10)

335127c

initialze common SEXPs in pkg init.

e4cda38

handle POSIXt conversion to numpy in c++

14d22fb

updates for new PyObjectRef

d885c5e

move PyErrorScopeGuard

d4c9f4b

py_dict_set_item - find custom __setitem__ methods

da1c85f

avoid repeated eval of constants

058fac9

convert logical arrays to numpy arrays with a strided view, avoid ful…

beb3ce6

…l copy

refactor py_to_r (c++ ver)

11ba1de

update py_to_r() R functions

c645db1

update r_to_py and py_to_r methods for datetimes/ POSIXt

7b0f8e6

pandas conversion fixes

c3fe9a5

source_python: don't export 'r'

1881a7b

simplify py_maybe_convert()

5b0b30d

remove module_proxy checks; simplify r getitem and getattr

093f915

n R code there is now generally no need to call if(py_is_module_proxy(x)) py_resolve_module_proxy(x) This is handled upstream in PyObjectRef()

r utils

a5951f8

handle pandas na conversion in S3 methods

9894670

call S3 generics from c++ in userenv

225fb13

checkin updated PyObjectRef

caf60b5

redocument

ed1e70b

move py_del_item to c++

715eaa5

consolidate py_initialize and py_flush guardrails in cpp

64a1729

t-kalinowski added 6 commits March 19, 2024 11:03

expand conversion tests

86a8c57

fix tests

cab0226

expand CI

82048e9

more tests

11fc42d

proofread + fixes

ad6b8aa

add NEWS

ce917fd

t-kalinowski requested a review from kevinushey March 19, 2024 17:41

t-kalinowski changed the title ~~Py to r refactor~~ py_to_r refactor (c++) Mar 19, 2024

kevinushey reviewed Mar 19, 2024

View reviewed changes

t-kalinowski added 2 commits March 19, 2024 17:59

address review feedback

a713a32

add tests

1906d01

t-kalinowski merged commit 7fba309 into main Mar 20, 2024
13 checks passed

t-kalinowski deleted the py_to_r-refactor branch March 20, 2024 13:48

This was referenced Mar 20, 2024

Minor fixes #1553

Merged

Resolve Module Proxies #1561

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`py_to_r` refactor (c++) #1552

`py_to_r` refactor (c++) #1552

t-kalinowski commented Mar 19, 2024 •

edited

Loading

kevinushey left a comment

kevinushey Mar 19, 2024

kevinushey commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024 •

edited

Loading

t-kalinowski commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024 •

edited

Loading

kevinushey commented Mar 19, 2024

py_to_r refactor (c++) #1552

py_to_r refactor (c++) #1552

Conversation

t-kalinowski commented Mar 19, 2024 • edited Loading

kevinushey left a comment

Choose a reason for hiding this comment

kevinushey Mar 19, 2024

Choose a reason for hiding this comment

kevinushey commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024 • edited Loading

t-kalinowski commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024

t-kalinowski commented Mar 19, 2024 • edited Loading

kevinushey commented Mar 19, 2024

`py_to_r` refactor (c++) #1552

`py_to_r` refactor (c++) #1552

t-kalinowski commented Mar 19, 2024 •

edited

Loading

t-kalinowski commented Mar 19, 2024 •

edited

Loading

t-kalinowski commented Mar 19, 2024 •

edited

Loading