Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas DataFrame with np.nan are not converted correctly #207

Open
lepennec opened this issue Mar 31, 2018 · 13 comments
Open

pandas DataFrame with np.nan are not converted correctly #207

lepennec opened this issue Mar 31, 2018 · 13 comments
Assignees
Milestone

Comments

@lepennec
Copy link

lepennec commented Mar 31, 2018

Dear all,
I've stumbled with an issue with the way NA are encoded in Python: it seems that the clasical way is to use np.nan whatever the data type as described in https://pandas.pydata.org/pandas-docs/stable/missing_data.html. This is the convention used by pd.read_csv for instance. When the data type is a "string" this leads to an issue: reticulate fails to convert such a DataFrame as shown in the following example

library(reticulate)

py_run_string('import pandas as pd')
py_run_string('import numpy as np')
py_run_string('date = ["2017/01/12", np.nan]')
py_run_string('data = {"date" : date}')
py_run_string('df = pd.DataFrame(data)')
py_run_string('dfcol = df["date"]')

py$df
py$dfcol

It seems that the column is converted as a list and not to a vector with missing data which seems to be the expected behavior from a Python user perspective.

Yours;

Erwan

@jjallaire
Copy link
Member

@kevinushey Could you take a look at this one?

@kevinushey kevinushey self-assigned this Apr 2, 2018
@kevinushey
Copy link
Collaborator

kevinushey commented Apr 2, 2018

Perhaps rather than converting np.nan as NaN (which only exists in R for numeric types) we should use NA (which R defines for all vector types)? Or perhaps in this specific case we need to use NA instead of NaN.

That said, I would highly recommend using NumPy datetime64 objects rather than Python datetime objects within Pandas DataFrames.

@kevinushey
Copy link
Collaborator

kevinushey commented Apr 2, 2018

Now fixed up on master, but note that deserializing this object gives you list-columns, e.g.

> py$df
        date
1 2017/01/12
2        NaN
> str(py$df)
'data.frame':	2 obs. of  1 variable:
 $ date:List of 2
  ..$ : chr "2017/01/12"
  ..$ : num NaN

We could consider looping and simplifying character vectors generated in this way, though. @jjallaire what do you think?

@andyfarmerboy
Copy link

andyfarmerboy commented Apr 7, 2018

Note: a related conversion failure occurs where missing values in Pandas DataFrames are coded as None. Ideally these would be converted to NA in R at least in character columns. The only way around the issue I've found so far is using .fillna method and inserting an arbitrary character string as a placeholder.

@terrytangyuan
Copy link
Contributor

Related issue but for NAs in general: #197

@alexenge
Copy link

Has there been any update on this? It would be super useful if NaNs inside a string column of a pandas DataFrames would automatically be converted to R's NA, so that R could treat the column as a character vector instead of a list.

@kevinushey kevinushey added this to the 1.24 milestone Jan 17, 2022
@sushantd195
Copy link

Was this issue resolved with version 1.24? I see the milestone for this but 1.24 release notes doesn't mention anything

@shivam7898
Copy link

I am using version 1.25 and the issue seems to be open. I agree with the sentiments shared by others. Pandas converts None in numeric type columns to np.nan (A) and in object type columns it treats None (B) & np.nan (C) as it is. During conversion to R data.frame, these should be treated as NA. My personal wishlist would be :

  • All 3 (A, B, C) to be converted to NA
  • Else (B & C) None & np.nan in 'object' type to be converted to NA
  • Else (C) None in 'object' type should be converted to NA
  • At lease column conversion to list containing NULL or NaN should be avoided
# Python Chunk
import pandas as pd, numpy as np
pp_ok = pd.DataFrame({'x': (None, 1), 'y': ('a', 'bb')}) 
ss_none = pd.DataFrame({'x': (None, 1), 'y': ('a', None)}) 
rr_nan = pd.DataFrame({'x': (None, 1), 'y': ('a', np.nan)}) 
# 'object' type columns keep None as None (Numeric types treat it as np.nan)
print(ss_none)
##      x     y
## 0  NaN     a
## 1  1.0  None
print(rr_nan)
##      x    y
## 0  NaN    a
## 1  1.0  NaN
# NULL should be avoided in DataFrame
str(py$pp_ok, give.attr = FALSE)        #np.nan (of numeric) to NaN in vector
## 'data.frame':    2 obs. of  2 variables:
##  $ x: num  NaN 1
##  $ y: chr  "a" "bb"
str(py$ss_none, give.attr = FALSE)      #None (of object) to NULL in list
## 'data.frame':    2 obs. of  2 variables:
##  $ x: num  NaN 1
##  $ y:List of 2
##   ..$ : chr "a"
##   ..$ : NULL
str(py$rr_nan, give.attr = FALSE)       #np.nan (of object) to NaN in list
## 'data.frame':    2 obs. of  2 variables:
##  $ x: num  NaN 1
##  $ y:List of 2
##   ..$ : chr "a"
##   ..$ : num NaN
Click to expand: sessionInfo() & py_config()
## R SessionInfo
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reticulate_1.25
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9      lattice_0.20-45 png_0.1-7       digest_0.6.29  
##  [5] grid_4.2.1      R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3 
##  [9] evaluate_0.15   stringi_1.7.8   rlang_1.0.3     cli_3.3.0      
## [13] rstudioapi_0.13 jquerylib_0.1.4 Matrix_1.4-1    bslib_0.3.1    
## [17] rmarkdown_2.14  tools_4.2.1     stringr_1.4.0   xfun_0.31      
## [21] yaml_2.3.5      fastmap_1.1.0   compiler_4.2.1  htmltools_0.5.2
## [25] knitr_1.39      sass_0.4.1
## Python Configuration
py_config()
## python:         C:/Softwares/Python/Python310/python.exe
## libpython:      C:/Softwares/Python/Python310/python310.dll
## pythonhome:     C:/Softwares/Python/Python310
## version:        3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Softwares/Python/Python310/Lib/site-packages/numpy
## numpy_version:  1.23.1
## 
## NOTE: Python version was forced by RETICULATE_PYTHON

@JonathanDeacon
Copy link

We're having similar issues with Pandas dataframe to R conversions producing NaNs in character columns rather than the expected NA. These NaNs then incorrectly resolve as FALSE in admiral functions that utilise is.na(x):

t1 <- c(NA, NaN, 'A', '', NULL)
t2 <- c(NA, NaN, 3, NULL)
t1d <- as.data.frame(t1)
t2d <- as.data.frame(t2)
is.na(t1)
[1]  TRUE FALSE FALSE FALSE
is.na(t2)
[1]  TRUE  TRUE FALSE
is.na(t1d)
        t1
[1,]  TRUE
[2,] FALSE
[3,] FALSE
[4,] FALSE
is.na(t2d)
        t2
[1,]  TRUE
[2,]  TRUE
[3,] FALSE

Ideally in our use case missing representaions like NaN or None in Pandas object columns would automatically convert to NA.

@PointyShinyBurning
Copy link

My current workaround is to df.fillna(pandas.NA) in the Python code and you can then df_fixed <- df_broken %>% replace(is.na(.), NA) %>% unnest(where(is.list)) in R.

The intent of pandas.NA seems identical to R NA to me, but the above is necessary because reticulate (1.26) interprets them as NA_real_ instead.

@ArthurAndrews
Copy link

I have experienced these same issues in my own work.

I made an example to illustrate the behavior of converting pandas dataframes to R. I hope it's useful! :)

Summary of my observations:

  • reticulate can convert missing values in numerical pandas columns, but it will appear as NaN in R instead of NA
  • there is more complex and problematic behavior for character pandas columns
  • missing values in pandas character columns may appears as np.nan or None depending on preference of Python programmer
  • pandas character columns may exist as numpy character arrays or python series depending on preference of Python programmer
  • the resulting R column may be a list or a character vector depending if the python character column is a numpy array or pandas series
  • the np.nans or Nones in numpy arrays are converted to "nan" or "None" in the R character vector
library(reticulate)
library(tibble)
  
py_run_string(
  paste(
    "import pandas as pd",
    "import numpy as np",
    "df = pd.DataFrame()",
    "df['numeric'] = pd.Series([1, 2, 3])",
    "df['numeric_na'] = pd.Series([1, 2, np.nan])",
    "df['character_pd_none'] = pd.Series(['a', 'b', None], dtype = 'str')",
    "df['character_np_none'] = np.array(['a', 'b', None], dtype = 'str')",
    "df['character_pd_nan'] = pd.Series(['a', 'b', np.nan], dtype = 'str')",
    "df['character_np_nan'] = np.array(['a', 'b', np.nan], dtype = 'str')",
    sep = "\n"
  )
) 

glimpse(py$df)
#> Rows: 3
#> Columns: 6
#> $ numeric           <dbl> 1, 2, 3
#> $ numeric_na        <dbl> 1, 2, NaN
#> $ character_pd_none <list> "a", "b", <NULL>
#> $ character_np_none <chr> "a", "b", "None"
#> $ character_pd_nan  <list> "a", "b", NaN
#> $ character_np_nan  <chr> "a", "b", "nan"

Created on 2023-02-02 with reprex v2.0.2

@ArthurAndrews
Copy link

Any thoughts on this? Reticulate is an amazing utility, but I experience this issue every time I use it: Character vectors in pandas data frames with missing values often become R lists with NULLs and overall, their conversion is unreliable.

As a user, the challenge I experience is this: I use reticulate to run Python code that returns a pandas data frame. I put a reticulate wrapper over top. Reticulate returns an R data frame with a list column and I don't know what to do with it. Does this list column have a nested structure that should be respected? Or is it just a wonky conversion of a pandas character vector?

My vote is that pandas character vectors of all flavors (pd.Series, np.arrays, containing Nones, containing nans, etc.) should be converted to R character vectors with NAs. Never a list, and never a vector with NaN.

Greatly appreciate all the hard work on this amazing package. I don't know what I would do without it!

@dfalbel
Copy link
Member

dfalbel commented Oct 4, 2023

Hi @ArthurAndrews,

We have made some improvements with an options described in #1439
I think you need the dev version of reticulate though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests