Py4J has bigger overhead than Jython and JPype

2019-06-23 23:16发布

问题:

After searching for an option to run Java code from Django application(python), I found out that Py4J is the best option for me. I tried Jython, JPype and Python subprocess and each of them have certain limitations:

  • Jython. My app runs in python.
  • JPype is buggy. You can start JVM just once after that it fails to start again.
  • Python subprocess. Cannot pass Java object between Python and Java, because of regular console call.

On Py4J web site is written:

In terms of performance, Py4J has a bigger overhead than both of the previous solutions (Jython and JPype) because it relies on sockets, but if performance is critical to your application, accessing Java objects from Python programs might not be the best idea.

In my application performance is critical, because I'm working with Machine learning framework Mahout. My question is: Will Mahout also run slower because of Py4J gateway server or this overhead just mean that invoking Java methods from Python functions is slower (in latter case performance of Mahout will not be a problem and I can use Py4J).

回答1:

I don't know Mahout. But think about that: At least with JPype and Py4J you will have performance impact when converting types from Java to Python and vice versa. Try to minimize calls between the languages. Maybe it's an alternative for you to code a thin wrapper in Java that condenses many Javacalls to one python2java call.



回答2:

Because the performance is also a question about your usage screnario (how often you call the script and how large is the data that is moved) and because the different solutions have their own specific benefits/drawbacks, I have created an API to switch between different implementations without you having to change your python script: https://github.com/subes/invesdwin-context-python

Thus testing what works best or just being flexible about what to deploy to is really easy.



回答3:

PySpark uses Py4J quite successfully. If all the heavylifting is done on Spark (or Mahout in your case) itself, and you just want to return result back to "driver"/Python code, then Py4J might work for you very well as well.

Py4j has slightly bigger overhead for huge results (that's not necessarily the case for Spark workloads, as you only return summaries /aggregates for the dataframes). There is an improvement discussion for py4j to switch to binary serialization to remove that overhead for higher badnwidth requirements too: https://github.com/bartdag/py4j/issues/159



回答4:

JPype issue that @HIP_HOP mentioned with JVM getting detached from new threads can be overcome with the following hack (add it before the first call to Java objects in the new thread which does not have JVM yet):

# ensure that current thread is attached to JVM
# (essential to prevent JVM / entire container crashes 
# due to "JPJavaEnv::FindClass" errors)
if not jpype.isThreadAttachedToJVM():
    jpype.attachThreadToJVM()