pandas DataFrame with np.nan are not converted correctly #207

lepennec · 2018-03-31T14:39:52Z

Dear all,
I've stumbled with an issue with the way NA are encoded in Python: it seems that the clasical way is to use np.nan whatever the data type as described in https://pandas.pydata.org/pandas-docs/stable/missing_data.html. This is the convention used by pd.read_csv for instance. When the data type is a "string" this leads to an issue: reticulate fails to convert such a DataFrame as shown in the following example

library(reticulate)

py_run_string('import pandas as pd')
py_run_string('import numpy as np')
py_run_string('date = ["2017/01/12", np.nan]')
py_run_string('data = {"date" : date}')
py_run_string('df = pd.DataFrame(data)')
py_run_string('dfcol = df["date"]')

py$df
py$dfcol

It seems that the column is converted as a list and not to a vector with missing data which seems to be the expected behavior from a Python user perspective.

Yours;

Erwan

The text was updated successfully, but these errors were encountered:

jjallaire · 2018-04-01T11:28:41Z

@kevinushey Could you take a look at this one?

kevinushey · 2018-04-02T17:40:37Z

Perhaps rather than converting np.nan as NaN (which only exists in R for numeric types) we should use NA (which R defines for all vector types)? Or perhaps in this specific case we need to use NA instead of NaN.

That said, I would highly recommend using NumPy datetime64 objects rather than Python datetime objects within Pandas DataFrames.

kevinushey · 2018-04-02T17:50:49Z

Now fixed up on master, but note that deserializing this object gives you list-columns, e.g.

> py$df
        date
1 2017/01/12
2        NaN
> str(py$df)
'data.frame':	2 obs. of  1 variable:
 $ date:List of 2
  ..$ : chr "2017/01/12"
  ..$ : num NaN

We could consider looping and simplifying character vectors generated in this way, though. @jjallaire what do you think?

andyfarmerboy · 2018-04-07T17:29:35Z

Note: a related conversion failure occurs where missing values in Pandas DataFrames are coded as None. Ideally these would be converted to NA in R at least in character columns. The only way around the issue I've found so far is using .fillna method and inserting an arbitrary character string as a placeholder.

terrytangyuan · 2018-04-07T21:04:37Z

Related issue but for NAs in general: #197

alexenge · 2022-01-17T15:29:02Z

Has there been any update on this? It would be super useful if NaNs inside a string column of a pandas DataFrames would automatically be converted to R's NA, so that R could treat the column as a character vector instead of a list.

sushantd195 · 2022-06-07T17:10:56Z

Was this issue resolved with version 1.24? I see the milestone for this but 1.24 release notes doesn't mention anything

shivam7898 · 2022-07-14T18:44:06Z

I am using version 1.25 and the issue seems to be open. I agree with the sentiments shared by others. Pandas converts None in numeric type columns to np.nan (A) and in object type columns it treats None (B) & np.nan (C) as it is. During conversion to R data.frame, these should be treated as NA. My personal wishlist would be :

All 3 (A, B, C) to be converted to NA
Else (B & C) None & np.nan in 'object' type to be converted to NA
Else (C) None in 'object' type should be converted to NA
At lease column conversion to list containing NULL or NaN should be avoided

# Python Chunk
import pandas as pd, numpy as np
pp_ok = pd.DataFrame({'x': (None, 1), 'y': ('a', 'bb')}) 
ss_none = pd.DataFrame({'x': (None, 1), 'y': ('a', None)}) 
rr_nan = pd.DataFrame({'x': (None, 1), 'y': ('a', np.nan)}) 
# 'object' type columns keep None as None (Numeric types treat it as np.nan)
print(ss_none)
##      x     y
## 0  NaN     a
## 1  1.0  None
print(rr_nan)
##      x    y
## 0  NaN    a
## 1  1.0  NaN

# NULL should be avoided in DataFrame
str(py$pp_ok, give.attr = FALSE)        #np.nan (of numeric) to NaN in vector
## 'data.frame':    2 obs. of  2 variables:
##  $ x: num  NaN 1
##  $ y: chr  "a" "bb"
str(py$ss_none, give.attr = FALSE)      #None (of object) to NULL in list
## 'data.frame':    2 obs. of  2 variables:
##  $ x: num  NaN 1
##  $ y:List of 2
##   ..$ : chr "a"
##   ..$ : NULL
str(py$rr_nan, give.attr = FALSE)       #np.nan (of object) to NaN in list
## 'data.frame':    2 obs. of  2 variables:
##  $ x: num  NaN 1
##  $ y:List of 2
##   ..$ : chr "a"
##   ..$ : num NaN

Click to expand: sessionInfo() & py_config()

## R SessionInfo
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reticulate_1.25
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9      lattice_0.20-45 png_0.1-7       digest_0.6.29  
##  [5] grid_4.2.1      R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3 
##  [9] evaluate_0.15   stringi_1.7.8   rlang_1.0.3     cli_3.3.0      
## [13] rstudioapi_0.13 jquerylib_0.1.4 Matrix_1.4-1    bslib_0.3.1    
## [17] rmarkdown_2.14  tools_4.2.1     stringr_1.4.0   xfun_0.31      
## [21] yaml_2.3.5      fastmap_1.1.0   compiler_4.2.1  htmltools_0.5.2
## [25] knitr_1.39      sass_0.4.1

## Python Configuration
py_config()
## python:         C:/Softwares/Python/Python310/python.exe
## libpython:      C:/Softwares/Python/Python310/python310.dll
## pythonhome:     C:/Softwares/Python/Python310
## version:        3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Softwares/Python/Python310/Lib/site-packages/numpy
## numpy_version:  1.23.1
## 
## NOTE: Python version was forced by RETICULATE_PYTHON

JonathanDeacon · 2022-09-21T07:53:33Z

We're having similar issues with Pandas dataframe to R conversions producing NaNs in character columns rather than the expected NA. These NaNs then incorrectly resolve as FALSE in admiral functions that utilise is.na(x):

t1 <- c(NA, NaN, 'A', '', NULL)
t2 <- c(NA, NaN, 3, NULL)
t1d <- as.data.frame(t1)
t2d <- as.data.frame(t2)
is.na(t1)
[1]  TRUE FALSE FALSE FALSE
is.na(t2)
[1]  TRUE  TRUE FALSE
is.na(t1d)
        t1
[1,]  TRUE
[2,] FALSE
[3,] FALSE
[4,] FALSE
is.na(t2d)
        t2
[1,]  TRUE
[2,]  TRUE
[3,] FALSE

Ideally in our use case missing representaions like NaN or None in Pandas object columns would automatically convert to NA.

PointyShinyBurning · 2022-10-18T10:18:51Z

My current workaround is to df.fillna(pandas.NA) in the Python code and you can then df_fixed <- df_broken %>% replace(is.na(.), NA) %>% unnest(where(is.list)) in R.

The intent of pandas.NA seems identical to R NA to me, but the above is necessary because reticulate (1.26) interprets them as NA_real_ instead.

ArthurAndrews · 2023-02-02T16:14:47Z

I have experienced these same issues in my own work.

I made an example to illustrate the behavior of converting pandas dataframes to R. I hope it's useful! :)

Summary of my observations:

reticulate can convert missing values in numerical pandas columns, but it will appear as NaN in R instead of NA
there is more complex and problematic behavior for character pandas columns
missing values in pandas character columns may appears as np.nan or None depending on preference of Python programmer
pandas character columns may exist as numpy character arrays or python series depending on preference of Python programmer
the resulting R column may be a list or a character vector depending if the python character column is a numpy array or pandas series
the np.nans or Nones in numpy arrays are converted to "nan" or "None" in the R character vector

library(reticulate)
library(tibble)
  
py_run_string(
  paste(
    "import pandas as pd",
    "import numpy as np",
    "df = pd.DataFrame()",
    "df['numeric'] = pd.Series([1, 2, 3])",
    "df['numeric_na'] = pd.Series([1, 2, np.nan])",
    "df['character_pd_none'] = pd.Series(['a', 'b', None], dtype = 'str')",
    "df['character_np_none'] = np.array(['a', 'b', None], dtype = 'str')",
    "df['character_pd_nan'] = pd.Series(['a', 'b', np.nan], dtype = 'str')",
    "df['character_np_nan'] = np.array(['a', 'b', np.nan], dtype = 'str')",
    sep = "\n"
  )
) 

glimpse(py$df)
#> Rows: 3
#> Columns: 6
#> $ numeric           <dbl> 1, 2, 3
#> $ numeric_na        <dbl> 1, 2, NaN
#> $ character_pd_none <list> "a", "b", <NULL>
#> $ character_np_none <chr> "a", "b", "None"
#> $ character_pd_nan  <list> "a", "b", NaN
#> $ character_np_nan  <chr> "a", "b", "nan"

^{Created on 2023-02-02 with reprex v2.0.2}

ArthurAndrews · 2023-10-04T13:51:49Z

Any thoughts on this? Reticulate is an amazing utility, but I experience this issue every time I use it: Character vectors in pandas data frames with missing values often become R lists with NULLs and overall, their conversion is unreliable.

As a user, the challenge I experience is this: I use reticulate to run Python code that returns a pandas data frame. I put a reticulate wrapper over top. Reticulate returns an R data frame with a list column and I don't know what to do with it. Does this list column have a nested structure that should be respected? Or is it just a wonky conversion of a pandas character vector?

My vote is that pandas character vectors of all flavors (pd.Series, np.arrays, containing Nones, containing nans, etc.) should be converted to R character vectors with NAs. Never a list, and never a vector with NaN.

Greatly appreciate all the hard work on this amazing package. I don't know what I would do without it!

dfalbel · 2023-10-04T19:21:41Z

Hi @ArthurAndrews,

We have made some improvements with an options described in #1439
I think you need the dev version of reticulate though.

kevinushey self-assigned this Apr 2, 2018

kevinushey added a commit that referenced this issue Apr 2, 2018

handle conversion of list columns (#207)

5dee89f

kevinushey added this to the 1.24 milestone Jan 17, 2022

dfalbel mentioned this issue Aug 15, 2023

Preserve NAs when casting R data.frames to pandas. #1439

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas DataFrame with np.nan are not converted correctly #207

pandas DataFrame with np.nan are not converted correctly #207

lepennec commented Mar 31, 2018 •

edited

Loading

jjallaire commented Apr 1, 2018

kevinushey commented Apr 2, 2018 •

edited

Loading

kevinushey commented Apr 2, 2018 •

edited

Loading

andyfarmerboy commented Apr 7, 2018 •

edited

Loading

terrytangyuan commented Apr 7, 2018

alexenge commented Jan 17, 2022

sushantd195 commented Jun 7, 2022

shivam7898 commented Jul 14, 2022

JonathanDeacon commented Sep 21, 2022

PointyShinyBurning commented Oct 18, 2022

ArthurAndrews commented Feb 2, 2023

ArthurAndrews commented Oct 4, 2023

dfalbel commented Oct 4, 2023

pandas DataFrame with np.nan are not converted correctly #207

pandas DataFrame with np.nan are not converted correctly #207

Comments

lepennec commented Mar 31, 2018 • edited Loading

jjallaire commented Apr 1, 2018

kevinushey commented Apr 2, 2018 • edited Loading

kevinushey commented Apr 2, 2018 • edited Loading

andyfarmerboy commented Apr 7, 2018 • edited Loading

terrytangyuan commented Apr 7, 2018

alexenge commented Jan 17, 2022

sushantd195 commented Jun 7, 2022

shivam7898 commented Jul 14, 2022

JonathanDeacon commented Sep 21, 2022

PointyShinyBurning commented Oct 18, 2022

ArthurAndrews commented Feb 2, 2023

ArthurAndrews commented Oct 4, 2023

dfalbel commented Oct 4, 2023

lepennec commented Mar 31, 2018 •

edited

Loading

kevinushey commented Apr 2, 2018 •

edited

Loading

kevinushey commented Apr 2, 2018 •

edited

Loading

andyfarmerboy commented Apr 7, 2018 •

edited

Loading