I'm using Rpy2 version 2.8.4 in conjunction with R 3.3.0 and python 2.7.10 to create an R dataframe
import rpy2.robjects as ro
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
df = ro.DataFrame({'Col1': ro.vectors.IntVector([1, 2, 3, 4, 5]),
'Col2': ro.vectors.StrVector(['a', 'b', 'c', 'd', 'e']),
'Col3': ro.vectors.FactorVector([1, 2, 3, ro.NA_Integer, ro.NA_Integer])})
print df
| Col2 | Col3 | Col1 |
----------------------
1 | a | 1 | 1 |
2 | b | 2 | 2 |
3 | c | 3 | 3 |
4 | d | NA | 4 |
5 | e | NA | 5 |
and I can convert this to a pandas dataframe without any trouble.
pandas2ri.ri2py(df)
| Col2 | Col3 | Col1 |
----------------------
1 | a | 1 | 1 |
2 | b | 2 | 2 |
3 | c | 3 | 3 |
4 | d | NA | 4 |
5 | e | NA | 5 |
However, I notice that the FactorVector metadata includes 'NA' as a factor level,
print r('levels(df$Col3)')
[1] "1" "2" "3" "NA"
which I understand is not default behaviour when creating factors in R.
If I drop 'NA' from the factor levels,
r.assign('df', df)
r('df$Col3 <- factor(as.numeric(levels(df$Col3))[df$Col3])')
then I get a very different result when converting the R dataframe to a pandas dataframe.
df2 = r['df']
pandas2ri.ri2py(df2)
| Col2 | Col3 | Col1 |
----------------------
1 | a | 1 | 1 |
2 | b | 2 | 2 |
3 | c | 3 | 3 |
4 | d | 1 | 4 |
5 | e | 1 | 5 |
My question is whether this is a bug, or am I doing something wrong by assuming that NA_Integer values should not be included as factor levels within R dataframes?