-->

Rpy2 pandas2ri.ri2py() is converting NA values to

2019-07-22 06:09发布

问题:

I'm using Rpy2 version 2.8.4 in conjunction with R 3.3.0 and python 2.7.10 to create an R dataframe

import rpy2.robjects as ro
from rpy2.robjects import r
from rpy2.robjects import pandas2ri

df = ro.DataFrame({'Col1': ro.vectors.IntVector([1, 2, 3, 4, 5]),
               'Col2': ro.vectors.StrVector(['a', 'b', 'c', 'd', 'e']),
               'Col3': ro.vectors.FactorVector([1, 2, 3, ro.NA_Integer, ro.NA_Integer])})
print df

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | NA   | 4    |
 5 |  e   | NA   | 5    |

and I can convert this to a pandas dataframe without any trouble.

pandas2ri.ri2py(df)

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | NA   | 4    |
 5 |  e   | NA   | 5    |

However, I notice that the FactorVector metadata includes 'NA' as a factor level,

print r('levels(df$Col3)')

[1] "1"  "2"  "3"  "NA"

which I understand is not default behaviour when creating factors in R.

If I drop 'NA' from the factor levels,

r.assign('df', df)
r('df$Col3 <- factor(as.numeric(levels(df$Col3))[df$Col3])')

then I get a very different result when converting the R dataframe to a pandas dataframe.

df2 = r['df']
pandas2ri.ri2py(df2)

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | 1    | 4    |
 5 |  e   | 1    | 5    |

My question is whether this is a bug, or am I doing something wrong by assuming that NA_Integer values should not be included as factor levels within R dataframes?

回答1:

The conversion of a column of factors in an R data.frame to a column in a pandas DataFrame is happening with that code. Nothing handling NAs in a specific way, so this must happen upstream of the conversion. If you look at your column "Col3" you'll see that NAs are already listed as levels in the factor.

>>> print(df.rx2("Col3"))
[1] 1  2  3  NA NA
Levels: 1 2 3 NA

This is even upstream of the creation of the R data.frame:

>>> lst = [1, 2, 3, ro.NA_Integer, ro.NA_Integer]
>>> print(ro.vectors.FactorVector(lst))
[1] 1  2  3  NA NA
Levels: 1 2 3 NA

What is happening is that the constructor for FactorVector in rpy2 is using a different default for the parameter exclude than what is in R's factor() function (I think that it was made so to make the mapping between the integers work as index for the vector of levels by default).

R's default behaviour can be restored with:

>>> v = ro.vectors.FactorVector(lst, exclude=ro.StrVector(["NA"]))
>>> print(v)
[1] 1    2    3    <NA> <NA>
Levels: 1 2 3

The issue here is that there are no guidelines for the representation of missing values (in the sense of an IEEE standard). R is using a arbitrary extreme value but Python does not have the notion of missing values.