Rpy2 pandas2ri.ri2py() is converting NA values to

I'm using Rpy2 version 2.8.4 in conjunction with R 3.3.0 and python 2.7.10 to create an R dataframe

import rpy2.robjects as ro
from rpy2.robjects import r
from rpy2.robjects import pandas2ri

df = ro.DataFrame({'Col1': ro.vectors.IntVector([1, 2, 3, 4, 5]),
               'Col2': ro.vectors.StrVector(['a', 'b', 'c', 'd', 'e']),
               'Col3': ro.vectors.FactorVector([1, 2, 3, ro.NA_Integer, ro.NA_Integer])})
print df

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | NA   | 4    |
 5 |  e   | NA   | 5    |

and I can convert this to a pandas dataframe without any trouble.

pandas2ri.ri2py(df)

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | NA   | 4    |
 5 |  e   | NA   | 5    |

However, I notice that the FactorVector metadata includes 'NA' as a factor level,

print r('levels(df$Col3)')

[1] "1"  "2"  "3"  "NA"

which I understand is not default behaviour when creating factors in R.

If I drop 'NA' from the factor levels,

r.assign('df', df)
r('df$Col3 <- factor(as.numeric(levels(df$Col3))[df$Col3])')

then I get a very different result when converting the R dataframe to a pandas dataframe.

df2 = r['df']
pandas2ri.ri2py(df2)

   | Col2 | Col3 | Col1 |
   ----------------------
 1 |  a   | 1    | 1    |
 2 |  b   | 2    | 2    |
 3 |  c   | 3    | 3    |
 4 |  d   | 1    | 4    |
 5 |  e   | 1    | 5    |

My question is whether this is a bug, or am I doing something wrong by assuming that NA_Integer values should not be included as factor levels within R dataframes?

标签： r python-2.7 rpy2

1条回答

神经病院院长

2楼-- · 2019-07-22 06:14

The conversion of a column of factors in an R data.frame to a column in a pandas DataFrame is happening with that code. Nothing handling NAs in a specific way, so this must happen upstream of the conversion. If you look at your column "Col3" you'll see that NAs are already listed as levels in the factor.

>>> print(df.rx2("Col3"))
[1] 1  2  3  NA NA
Levels: 1 2 3 NA

This is even upstream of the creation of the R data.frame:

>>> lst = [1, 2, 3, ro.NA_Integer, ro.NA_Integer]
>>> print(ro.vectors.FactorVector(lst))
[1] 1  2  3  NA NA
Levels: 1 2 3 NA

What is happening is that the constructor for FactorVector in rpy2 is using a different default for the parameter exclude than what is in R's factor() function (I think that it was made so to make the mapping between the integers work as index for the vector of levels by default).

R's default behaviour can be restored with:

>>> v = ro.vectors.FactorVector(lst, exclude=ro.StrVector(["NA"]))
>>> print(v)
[1] 1    2    3    <NA> <NA>
Levels: 1 2 3

The issue here is that there are no guidelines for the representation of missing values (in the sense of an IEEE standard). R is using a arbitrary extreme value but Python does not have the notion of missing values.

0人赞添加讨论(0) 举报

Rpy2 pandas2ri.ri2py() is converting NA values to

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间