I use a SGE cluster with IPcontroller running on the head node, and ~50 engines running on the other nodes (submitted using QSUB). The engines are able to connect and register with the controller without any issues. I can also connect to the head node using SSH and view the engines IDs and running simple code. For e.g., this works perfectly well :
%px %pylab inline
parallel_result = lbView.map_sync(lambda x: x*rand(), range(32))
However, when I try to run the following line, then the engines crash :
%px from sklearn.svm import LinearSVC
with the following error:
importing LinearSVC from sklearn.svm on engine(s)
[Engine Exception]
Traceback (most recent call last):
File "/usr/global/anaconda/lib/python2.7/site-packages/ipyparallel/client/client.py",
line 713, in _handle_stranded_msgs
raise error.EngineError("Engine %r died while running task %r"%(eid, msg_id))
EngineError: Engine 0 died while running task '48c99848-0784-4ea1-a8c9-900685e955a3
'
The exact same command works perfectly well when I run it on an IPython instance on the head node of the cluster, or even using IPyparallel on another server (no SGE) with 12 engines running locally.
I have set the logging level to debug, and here's what the engines and the controller output :
Snippet IPENGINE OUTPUT:
2016-05-28 18:18:48.403 [IPEngineApp] apply_request: {'parent_header': {}, 'msg_type': u'apply_request', 'msg_id': u'4ca3bef9-5cbf-4b56-a232-b3f289dcf6a6', 'content': {}, 'header': {u'username': u'ABC', u'version': u'5.0', u'msg_type': u'apply_request', u'msg_id': u'4ca3bef9-5cbf-4b56-a232-b3f289dcf6a6', u'session': u'83df95f4-e961-4e8f-aa3c-2540719e08f4', u'date': datetime.datetime(2016, 5, 28, 18, 18, 48, 392750)}, 'buffers': [<memory at 0x2aaab7c348a0>, <memory at 0x2aaab7c34d60>, <memory at 0x2aaab7c34df8>, <memory at 0x2aaab7c34e90>, <memory at 0x2aaab7ba3218>], 'metadata': {}}
Snippet IPCONTROLLER OUTPUT:
2016-05-28 18:19:26.043 [IPControllerApp] registration::unregister_engine(8)
2016-05-28 18:19:26.043 [IPControllerApp] save engine state to /data1/home/kamesh/.ipython/profile_KK_Fiji_SGE/log/engines.json
2016-05-28 18:19:26.045 [IPControllerApp] heartbeat::handle_heart_failure('37d9bc53-66f8-4d14-9501-02c56a0ff1f0')
2016-05-28 18:19:26.045 [IPControllerApp] registration::unregister_engine(2)
2016-05-28 18:19:26.046 [IPControllerApp] save engine state to /data1/home/ABC/.ipython/profile_KK_Fiji_SGE/log/engines.json
2016-05-29 07:31:35.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 1
2016-05-29 07:31:38.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 2
2016-05-29 07:31:41.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 3
2016-05-29 07:31:44.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 4
2016-05-29 07:31:47.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 5
2016-05-29 07:31:50.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 6
2016-05-29 07:31:53.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 7
2016-05-29 07:31:56.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 8
2016-05-29 07:31:59.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 9
2016-05-29 07:32:02.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 10
2016-05-29 07:32:05.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 11
2016-05-29 07:32:05.031 [IPControllerApp] heartbeat::handle_heart_failure('ec0b5d83-b354-43c6-b7ec-909f6fd403fc')
2016-05-29 07:32:05.031 [IPControllerApp] registration::unregister_engine(4)