我用一个SGE集群头节点上运行IPcontroller,并〜50个发动机的其他节点上运行(提交使用QSUB)。 该引擎能够没有任何问题与控制器进行连接和注册。 我还可以使用SSH连接到头节点和查看引擎ID和运行简单的代码。 对于例如,这工作得很好:
%px %pylab inline
parallel_result = lbView.map_sync(lambda x: x*rand(), range(32))
然而,当我尝试运行以下命令行,然后将引擎崩溃:
%px from sklearn.svm import LinearSVC
并出现以下错误:
importing LinearSVC from sklearn.svm on engine(s)
[Engine Exception]
Traceback (most recent call last):
File "/usr/global/anaconda/lib/python2.7/site-packages/ipyparallel/client/client.py",
line 713, in _handle_stranded_msgs
raise error.EngineError("Engine %r died while running task %r"%(eid, msg_id))
EngineError: Engine 0 died while running task '48c99848-0784-4ea1-a8c9-900685e955a3
“
相同的指令工作得很好,当我的簇头节点上的IPython的情况下运行它,甚至使用IPyparallel另一台服务器(无SGE)与12升的发动机在本地运行。
我已经设置了日志记录级别调试,这里就是发动机和控制器的输出:
片段IPENGINE OUTPUT:
2016-05-28 18:18:48.403 [IPEngineApp] apply_request: {'parent_header': {}, 'msg_type': u'apply_request', 'msg_id': u'4ca3bef9-5cbf-4b56-a232-b3f289dcf6a6', 'content': {}, 'header': {u'username': u'ABC', u'version': u'5.0', u'msg_type': u'apply_request', u'msg_id': u'4ca3bef9-5cbf-4b56-a232-b3f289dcf6a6', u'session': u'83df95f4-e961-4e8f-aa3c-2540719e08f4', u'date': datetime.datetime(2016, 5, 28, 18, 18, 48, 392750)}, 'buffers': [<memory at 0x2aaab7c348a0>, <memory at 0x2aaab7c34d60>, <memory at 0x2aaab7c34df8>, <memory at 0x2aaab7c34e90>, <memory at 0x2aaab7ba3218>], 'metadata': {}}
片段IPCONTROLLER OUTPUT:
2016-05-28 18:19:26.043 [IPControllerApp] registration::unregister_engine(8)
2016-05-28 18:19:26.043 [IPControllerApp] save engine state to /data1/home/kamesh/.ipython/profile_KK_Fiji_SGE/log/engines.json
2016-05-28 18:19:26.045 [IPControllerApp] heartbeat::handle_heart_failure('37d9bc53-66f8-4d14-9501-02c56a0ff1f0')
2016-05-28 18:19:26.045 [IPControllerApp] registration::unregister_engine(2)
2016-05-28 18:19:26.046 [IPControllerApp] save engine state to /data1/home/ABC/.ipython/profile_KK_Fiji_SGE/log/engines.json
2016-05-29 07:31:35.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 1
2016-05-29 07:31:38.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 2
2016-05-29 07:31:41.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 3
2016-05-29 07:31:44.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 4
2016-05-29 07:31:47.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 5
2016-05-29 07:31:50.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 6
2016-05-29 07:31:53.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 7
2016-05-29 07:31:56.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 8
2016-05-29 07:31:59.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 9
2016-05-29 07:32:02.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 10
2016-05-29 07:32:05.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 11
2016-05-29 07:32:05.031 [IPControllerApp] heartbeat::handle_heart_failure('ec0b5d83-b354-43c6-b7ec-909f6fd403fc')
2016-05-29 07:32:05.031 [IPControllerApp] registration::unregister_engine(4)