I'm using Rpy2 version 2.8.4 in conjunction with R 3.3.0 and python 2.7.10 to create an R dataframe
import rpy2.robjects as ro
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
df = ro.DataFrame({'Col1': ro.vectors.IntVector([1, 2, 3, 4, 5]),
'Col2': ro.vectors.StrVector(['a', 'b', 'c', 'd', 'e']),
'Col3': ro.vectors.FactorVector([1, 2, 3, ro.NA_Integer, ro.NA_Integer])})
print df
| Col2 | Col3 | Col1 |
----------------------
1 | a | 1 | 1 |
2 | b | 2 | 2 |
3 | c | 3 | 3 |
4 | d | NA | 4 |
5 | e | NA | 5 |
and I can convert this to a pandas dataframe without any trouble.
pandas2ri.ri2py(df)
| Col2 | Col3 | Col1 |
----------------------
1 | a | 1 | 1 |
2 | b | 2 | 2 |
3 | c | 3 | 3 |
4 | d | NA | 4 |
5 | e | NA | 5 |
However, I notice that the FactorVector metadata includes 'NA' as a factor level,
print r('levels(df$Col3)')
[1] "1" "2" "3" "NA"
which I understand is not default behaviour when creating factors in R.
If I drop 'NA' from the factor levels,
r.assign('df', df)
r('df$Col3 <- factor(as.numeric(levels(df$Col3))[df$Col3])')
then I get a very different result when converting the R dataframe to a pandas dataframe.
df2 = r['df']
pandas2ri.ri2py(df2)
| Col2 | Col3 | Col1 |
----------------------
1 | a | 1 | 1 |
2 | b | 2 | 2 |
3 | c | 3 | 3 |
4 | d | 1 | 4 |
5 | e | 1 | 5 |
My question is whether this is a bug, or am I doing something wrong by assuming that NA_Integer values should not be included as factor levels within R dataframes?
The conversion of a column of factors in an R
data.frame
to a column in a pandasDataFrame
is happening with that code. Nothing handling NAs in a specific way, so this must happen upstream of the conversion. If you look at your column"Col3"
you'll see that NAs are already listed as levels in the factor.This is even upstream of the creation of the R data.frame:
What is happening is that the constructor for
FactorVector
in rpy2 is using a different default for the parameterexclude
than what is in R'sfactor()
function (I think that it was made so to make the mapping between the integers work as index for the vector of levels by default).R's default behaviour can be restored with:
The issue here is that there are no guidelines for the representation of missing values (in the sense of an IEEE standard). R is using a arbitrary extreme value but Python does not have the notion of missing values.