In using rpy2
with a built-in dataset from the synthpop
R package (SD2011
), I get this error:
robjects.r('head(SD2011)')
# ...
# ValueError: codes need to be between -1 and len(categories)-1
I drilled down the problem into a column which has null entries, e.g. I get the same error when doing this, but not adjacent rows or columns:
robjects.r('SD2011[3, 27]')
I confirmed this is a null value with:
robjects.r('is.na(SD2011[, 27])')
# array([0, 0, 1, ..., 0, 0, 0], dtype=int32)
Why is rpy2
not handling this gracefully?
Here's my notebook running through it.
Why is rpy2 not handling this gracefully?
This seems like a bug triggered during the conversion of the R factor to pandas with rpy2 versions 2.9.x (the dev branch default
, future 3.0.x, does not have this issue). Specifically when doing:
res = pandas.Categorical.from_codes(numpy.asarray(obj) - 1,
categories = obj.do_slot('levels'),
ordered = 'ordered' in obj.rclass)
R "factor" objects are vector of integers, with each integer an index in an associated vector of "levels". The converter is simply subtracting one because R arrays are one-indexed and Python arrays are zero-index, but this is breaking whenever there are missing values (NAs) because R is using a specific integer to encode missing integers (an extreme value) and Python, numpy, and pandas does not have an equivalence for this.
I opened an issue to track this and in the meantime, workarounds can be to replace the NAs on the R side to a level (and call them, say, "missing" or "NA"), change the factors to arrays of strings, or to modify the pandas converter for R factors. For example:
robjects.r("""
SD2011_nofactor <- SD2011 %>%
dplyr::mutate_if(is.factor,
funs(as.character(.))
""")
(Or use rpy2's Pythonic interface to dplyr)
Note:
Few things are succcessively happening when doing:
robjects.r('SD2011[3, 27]')
- the R code
SD2011[3, 27]
is evaluated
- the result of that evaluation is going through the robjects-level conversion
- the object resulting from that conversion is shown in your notebook
If unsure, finding which one of the Python statements below is the first to fail can tell it:
Evaluate the R code (the added TRUE
is to prevent the evaluation from returning
x
).
robjects.r('x <- SD2011[3, 27]; TRUE')
Fetch the object x
obtained from the evaluation above and bind it to a Python symbol (the conversion will be aplied).
x = robjects.r('x')
Show a text representation of the converted object
repr(x)