I have some Python code that passes a data frame to R via rpy2, whereupon R processes it and I pull the resulting data.frame back to R as a PANDAS data frame via com.load_data
.
The thing is, the call to com.load_data
works fine in a single Python process but it crashes when the same bunch of code is run in several multiprocessing.Process
processes concurrently. I get the following error message out of Python:
File "C:\\Python27\\lib\\site-packages\\pandas\\rpy\\common.py", line 29, in load_data
r.data(name) TypeError: 'DataFrame' object is not callable'
So my question is, is rpy2 not actually designed to be able to be run in parallel, or is it merely a bug in the load_data
function? I just assumed that each Python process would get its own independent R session. As far as I can tell, the only workaround would be to have R write the output to a text file which the appropriate Python process can open and go on with its processing. But this is pretty clunky.
Update with some code:
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
import pandas as pd
import pandas.rpy.common as com
# Load C50 library into R environment
C50 = importr('C50')
...
# PANDAS data frame containing test dataset
testing = pd.DataFrame(testing)
# Pass testing dataset to R
rtesting = com.convert_to_r_dataframe(testing)
ro.globalenv['test'] = rtesting
# Strip "AsIs" from each column in the R data frame
# so that predict.C5.0 will work
for c in range(len(testing.columns)):
ro.r('''class(test[,{0}])=class(test[,{0}])[-match("AsIs", class(test[,{0}]))]'''.format(c+1))
# Make predictions on test dataset (res is pre-existing C5.0 tree)
ro.r('''preds=predict.C5.0(res, newdata=test)''')
ro.r('''preds=as.data.frame(preds)''')
# Get the predictions from R
preds = com.load_data('preds') ### Crashes here when code is run on several processes concurrently
#Further processing as necessary
...
rpy
works by running a Python process and an R process in parallel, and exchange information between them. It does not take into account that R calls are called in parallel usingmultiprocess
. So in practice, each of the python processes connects to the same R process. This probably causes the issues you see.One way to circumvent this issue is to implement the parallel processing in R, and not in Python. You then send everything at once to R, this will process it in parallel, and the result will be sent back to Python.
The following (python3) code suggests that, at least if a multiprocessing.Pool is used, separate R process are being spawned for each worker process (@lgautier is this right?)
On my OS X laptop results in something like