-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas DataFrame with np.nan are not converted correctly #207
Comments
@kevinushey Could you take a look at this one? |
Perhaps rather than converting That said, I would highly recommend using NumPy |
Now fixed up on master, but note that deserializing this object gives you list-columns, e.g.
We could consider looping and simplifying character vectors generated in this way, though. @jjallaire what do you think? |
Note: a related conversion failure occurs where missing values in Pandas DataFrames are coded as |
Related issue but for NAs in general: #197 |
Has there been any update on this? It would be super useful if |
Was this issue resolved with version 1.24? I see the milestone for this but 1.24 release notes doesn't mention anything |
I am using version 1.25 and the issue seems to be open. I agree with the sentiments shared by others. Pandas converts
# Python Chunk
import pandas as pd, numpy as np
pp_ok = pd.DataFrame({'x': (None, 1), 'y': ('a', 'bb')})
ss_none = pd.DataFrame({'x': (None, 1), 'y': ('a', None)})
rr_nan = pd.DataFrame({'x': (None, 1), 'y': ('a', np.nan)})
# 'object' type columns keep None as None (Numeric types treat it as np.nan)
print(ss_none)
## x y
## 0 NaN a
## 1 1.0 None
print(rr_nan)
## x y
## 0 NaN a
## 1 1.0 NaN # NULL should be avoided in DataFrame
str(py$pp_ok, give.attr = FALSE) #np.nan (of numeric) to NaN in vector
## 'data.frame': 2 obs. of 2 variables:
## $ x: num NaN 1
## $ y: chr "a" "bb"
str(py$ss_none, give.attr = FALSE) #None (of object) to NULL in list
## 'data.frame': 2 obs. of 2 variables:
## $ x: num NaN 1
## $ y:List of 2
## ..$ : chr "a"
## ..$ : NULL
str(py$rr_nan, give.attr = FALSE) #np.nan (of object) to NaN in list
## 'data.frame': 2 obs. of 2 variables:
## $ x: num NaN 1
## $ y:List of 2
## ..$ : chr "a"
## ..$ : num NaN Click to expand: sessionInfo() & py_config()## R SessionInfo
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reticulate_1.25
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 lattice_0.20-45 png_0.1-7 digest_0.6.29
## [5] grid_4.2.1 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3
## [9] evaluate_0.15 stringi_1.7.8 rlang_1.0.3 cli_3.3.0
## [13] rstudioapi_0.13 jquerylib_0.1.4 Matrix_1.4-1 bslib_0.3.1
## [17] rmarkdown_2.14 tools_4.2.1 stringr_1.4.0 xfun_0.31
## [21] yaml_2.3.5 fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.2
## [25] knitr_1.39 sass_0.4.1 ## Python Configuration
py_config()
## python: C:/Softwares/Python/Python310/python.exe
## libpython: C:/Softwares/Python/Python310/python310.dll
## pythonhome: C:/Softwares/Python/Python310
## version: 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:/Softwares/Python/Python310/Lib/site-packages/numpy
## numpy_version: 1.23.1
##
## NOTE: Python version was forced by RETICULATE_PYTHON |
We're having similar issues with Pandas dataframe to R conversions producing
Ideally in our use case missing representaions like NaN or None in Pandas object columns would automatically convert to NA. |
My current workaround is to The intent of |
I have experienced these same issues in my own work. I made an example to illustrate the behavior of converting pandas dataframes to R. I hope it's useful! :) Summary of my observations:
library(reticulate)
library(tibble)
py_run_string(
paste(
"import pandas as pd",
"import numpy as np",
"df = pd.DataFrame()",
"df['numeric'] = pd.Series([1, 2, 3])",
"df['numeric_na'] = pd.Series([1, 2, np.nan])",
"df['character_pd_none'] = pd.Series(['a', 'b', None], dtype = 'str')",
"df['character_np_none'] = np.array(['a', 'b', None], dtype = 'str')",
"df['character_pd_nan'] = pd.Series(['a', 'b', np.nan], dtype = 'str')",
"df['character_np_nan'] = np.array(['a', 'b', np.nan], dtype = 'str')",
sep = "\n"
)
)
glimpse(py$df)
#> Rows: 3
#> Columns: 6
#> $ numeric <dbl> 1, 2, 3
#> $ numeric_na <dbl> 1, 2, NaN
#> $ character_pd_none <list> "a", "b", <NULL>
#> $ character_np_none <chr> "a", "b", "None"
#> $ character_pd_nan <list> "a", "b", NaN
#> $ character_np_nan <chr> "a", "b", "nan" Created on 2023-02-02 with reprex v2.0.2 |
Any thoughts on this? Reticulate is an amazing utility, but I experience this issue every time I use it: Character vectors in pandas data frames with missing values often become R lists with NULLs and overall, their conversion is unreliable. As a user, the challenge I experience is this: I use reticulate to run Python code that returns a pandas data frame. I put a reticulate wrapper over top. Reticulate returns an R data frame with a list column and I don't know what to do with it. Does this list column have a nested structure that should be respected? Or is it just a wonky conversion of a pandas character vector? My vote is that pandas character vectors of all flavors (pd.Series, np.arrays, containing Nones, containing nans, etc.) should be converted to R character vectors with NAs. Never a list, and never a vector with NaN. Greatly appreciate all the hard work on this amazing package. I don't know what I would do without it! |
Hi @ArthurAndrews, We have made some improvements with an options described in #1439 |
Dear all,
I've stumbled with an issue with the way NA are encoded in Python: it seems that the clasical way is to use np.nan whatever the data type as described in https://pandas.pydata.org/pandas-docs/stable/missing_data.html. This is the convention used by pd.read_csv for instance. When the data type is a "string" this leads to an issue: reticulate fails to convert such a DataFrame as shown in the following example
It seems that the column is converted as a list and not to a vector with missing data which seems to be the expected behavior from a Python user perspective.
Yours;
Erwan
The text was updated successfully, but these errors were encountered: