In using rpy2
with a built-in dataset from the synthpop
R package (SD2011
), I get this error:
robjects.r('head(SD2011)')
# ...
# ValueError: codes need to be between -1 and len(categories)-1
I drilled down the problem into a column which has null entries, e.g. I get the same error when doing this, but not adjacent rows or columns:
robjects.r('SD2011[3, 27]')
I confirmed this is a null value with:
robjects.r('is.na(SD2011[, 27])')
# array([0, 0, 1, ..., 0, 0, 0], dtype=int32)
Why is rpy2
not handling this gracefully?
Here's my notebook running through it.
This seems like a bug triggered during the conversion of the R factor to pandas with rpy2 versions 2.9.x (the dev branch
default
, future 3.0.x, does not have this issue). Specifically when doing:R "factor" objects are vector of integers, with each integer an index in an associated vector of "levels". The converter is simply subtracting one because R arrays are one-indexed and Python arrays are zero-index, but this is breaking whenever there are missing values (NAs) because R is using a specific integer to encode missing integers (an extreme value) and Python, numpy, and pandas does not have an equivalence for this.
I opened an issue to track this and in the meantime, workarounds can be to replace the NAs on the R side to a level (and call them, say, "missing" or "NA"), change the factors to arrays of strings, or to modify the pandas converter for R factors. For example:
(Or use rpy2's Pythonic interface to dplyr)
Note:
Few things are succcessively happening when doing:
SD2011[3, 27]
is evaluatedIf unsure, finding which one of the Python statements below is the first to fail can tell it:
Evaluate the R code (the added
TRUE
is to prevent the evaluation from returningx
).Fetch the object
x
obtained from the evaluation above and bind it to a Python symbol (the conversion will be aplied).Show a text representation of the converted object