Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py_to_r refactor (c++) #1552

Merged
merged 38 commits into from
Mar 20, 2024
Merged

py_to_r refactor (c++) #1552

merged 38 commits into from
Mar 20, 2024

Conversation

t-kalinowski
Copy link
Member

@t-kalinowski t-kalinowski commented Mar 19, 2024

This PR diff is unfortunately larger than I'd like. It started with trying to fix one bug, but with the way the py_to_r() conversion routines were factored, I realized it made sense to implement the fix "upstream", in PyObjectRef, rather than at each cpp call site of py_to_r() and py_ref(). That led me to need to touch code in lots of places (and load it all in my head). As I refactored, I discovered more bugs and, with the code fresh in my mind, I also implemented some straightforward fixes and optimizations. The size of the changes snowballed, and at this point, dividing them into smaller PRs doesn't seem worth the effort.

User facing changes:

  • Callable python objects created with convert = FALSE now always get wrapped in a function.
    (i.e., typeof(py_eval("lambda x: x", convert = FALSE)) == "function")). This now works:

    fn <- py_eval("lambda x: x + 1", convert = FALSE)
    fn(2) # error in current version, required using py_call()
          # now returns a py ref to '3.0' with this PR
  • py_to_r() S3 methods now are called on python objects supplied as args to R functions being called from Python if the R function was converted with convert = TRUE. This used to error, and now works:

    myfoo <- reticulate::py_eval("type('MyFoo', (), {})")()
    py_docall <- reticulate::py_eval("lambda fn, *args: fn(*args)")
    py_to_r.__main__.MyFoo <- function(x) 42
    registerS3method("py_to_r", "__main__.MyFoo", 
                     py_to_r.__main__.MyFoo, 
                     asNamespace("reticulate"))
    py_docall(function(x) {
      if (x != 42) stop("custom py_to_r method not found")
    }, myfoo)    
  • py_to_r(x) no longer signals an error if x is not a python object. In that case, x is returned unmodified.
    (This is to avoid introducing new errors in user code, if users wrote previous workarounds for S3 generics that should have been getting called, but weren't.)

  • attr(x, "tzone") attributes are (better) preserved when converting POSIXt to Python.
    POSIXt types with a non-empty tzone attr will always convert to a datetime.datetime,
    otherwise they will convert to numpy datetime64[ns] arrays.

  • Fixed an issue where calling py_set_item() on a subclassed dict would not invoke a custom __setitem__ method.

  • py_del_attr(x, name) now returns x invisibly

  • source_python() no longer exports assigns the r symbol
    (the "R Interface object" that is used by python code get a reference to the R globalenv)

Developer facing changes:

  • in c++ functions, r_to_py(RObject, convert) is now all you need to go from SEXP to PyObject*

  • in c++ functions, py_to_r(PyObject*, convert) is now all you need to go from PyObject* to SEXP

  • py_to_r(PyObject*, convert = false) will now always return a PyObjectRef.

  • py_to_r(PyObject*, convert = true) will now invoke py_to_r() S3 methods, if the object is not a "simple" python object.

  • r_to_py(RObject, convert) now returns the underlying PyObject* if it is a PyObjectRef already.

  • The SEXP underlying PyObjectRef can now be either an R environment, or an R closure (if PyObject* is callable). All the common logic for building up a PyObjectRef for presentation in R is now done within the PyObjectRef constructor. This includes building up the closure if the PyObject is callable, and building up the S3 class attribute.

  • py_to_r_wrapper() S3 generic will be called on all callable PyObjectRefs
    (both with convert = TRUE and convert = FALSE)

  • in R code there is now generally no need to call
    if(py_is_module_proxy(x)) py_resolve_module_proxy(x)
    or
    ensure_python_initialized()
    This is handled upstream in PyObjectRef or the individual c++ methods

Optimizations:

  • logical R arrays get converted to numpy bool arrays that are strided views of the R array (avoiding a full copy)
  • removed (now) unnecessary S3 methods/code that are now handled in c++
  • misc small speedups (e.g., repeated calls in for loops to INTEGER() and similar have been moved out)

In some ad-hoc benchmarks simulating some series of operations with python objects, I see ~2x speedup. E.g., this is ~2x faster:

{
 x <- np_array(array(runif(1000)))
 y <- np_array(array(runif(1000)))
 z <- x + y
 z <- z[x > .5 & y < .5]
 py_to_r(z)
}

Though, in some minimal benchmarks of a callable created with convert = FALSE, there is now a small slowdown due the extra work of building up an R closure (including S3 dispatch of py_to_r_wrapper()). I.e. py_eval("lambda x: x", convert = FALSE)

n R code there is now generally no need to call
  if(py_is_module_proxy(x)) py_resolve_module_proxy(x)
This is handled upstream in PyObjectRef()
@t-kalinowski t-kalinowski requested a review from kevinushey March 19, 2024 17:41
@t-kalinowski t-kalinowski changed the title Py to r refactor py_to_r refactor (c++) Mar 19, 2024
Copy link
Collaborator

@kevinushey kevinushey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but it's a large PR so it's challenging to review in depth. Are there more tests we could add to validate that we haven't changed any existing behavior unexpectedly?

src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated
@@ -1084,7 +1319,12 @@ SEXP py_to_r(PyObject* x, bool convert) {
}

// dict
else if (PyDict_CheckExact(x)) {
if (PyDict_CheckExact(x)) {
// if you tempted to change this to PyDict_Check() to allow subclasses, don't.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
src/python.cpp Outdated Show resolved Hide resolved
@kevinushey
Copy link
Collaborator

Callable python objects created with convert = FALSE now always get wrapped in a function.
(i.e., typeof(py_eval("lambda x: x", convert = FALSE)) == "function")).

Can you elaborate on the motivation? What if a user wants to create a Python function from R, that is later passed back into Python? A dummy example:

py_run_string("def apply(f, x): return f(x)")
fn <- py_eval("lambda x: x + 1")
main <- import_main()
main$apply(fn, 2)

@t-kalinowski
Copy link
Member Author

t-kalinowski commented Mar 19, 2024

The example works just fine - and should be unchanged from before:

library(reticulate)

py_run_string("def apply(f, x): return f(x)")
fn <- py_eval("lambda x: x + 1")
main <- import_main()
main$apply(fn, 2)
#> [1] 3

What is different is what happens when convert = FALSE

fn <- py_eval("lambda x: x + 1", convert = FALSE)
fn(2) # error in current version, required using py_call()
      # now returns a py ref to '3.0' with this PR

@t-kalinowski
Copy link
Member Author

The specific bug that I was initially trying to fix was in tfdatasets:

library(tfdatasets)
make_csv_dataset("hearts.csv") |>
  dataset_map(\(x) { ... })

Where I was expecting x to come in as a named R list, but instead was coming in as an unconverted OrderedDict. Pulling at the thread, I noticed that we already define the S3 method, but that S3 methods are not called by the C++ version of py_to_r().

@t-kalinowski
Copy link
Member Author

Closes #1024

@t-kalinowski
Copy link
Member Author

Are there more tests we could add to validate that we haven't changed any existing behavior unexpectedly?

Do you have any specific edge-cases in mind? I already expanded some tests related to conversion, and will still add a few more related to logical numpy array conversion. Anything else?

@t-kalinowski
Copy link
Member Author

t-kalinowski commented Mar 19, 2024

A quick benchmark:
This snippet simulating a "real" workflow, comparing the current main branch and this PR branch (~3x faster now):

library(reticulate)
bench::mark(np_array_work = {
  x <- np_array(array(runif(1000)))
  y <- np_array(array(runif(1000)))
  z <- x + y
  w <- py_to_r(x > .5 & y < .5)
  z <- z[w]
  py_to_r(z)
})
#> # A tibble: 2 × 12
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 main_now      1.4ms   1.49ms      652.     435KB     8.25  9875   125
#> 2 refactored  457.8µs 485.67µs     1961.     308KB    13.2   9933    67
#> # ℹ 4 more variables: total_time <bch:tm>, memory <list>, time <list>,
#> #   gc <list>

@kevinushey
Copy link
Collaborator

Do you have any specific edge-cases in mind? I already expanded some tests related to conversion, and will still add a few more related to logical numpy array conversion. Anything else?

Nothing specifically, just general trepidation resulting from seeing a large PR :-)

Do we already have test coverage for confirming that the removal of these S3 methods doesn't affect anything here? https://github.com/rstudio/reticulate/pull/1552/files#diff-b6b6854a02ae177f8860f49654ff250c1554d021ae434b6efdbe99d00bdc6055L7

@t-kalinowski t-kalinowski merged commit 7fba309 into main Mar 20, 2024
13 checks passed
@t-kalinowski t-kalinowski deleted the py_to_r-refactor branch March 20, 2024 13:48
This was referenced Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants