可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I installed oozie 4.1.0 on a Linux machine by following the steps at http://gauravkohli.com/2014/08/26/apache-oozie-installation-on-hadoop-2-4-1/

hadoop version - 2.6.0 
maven - 3.0.4 
pig - 0.12.0

Cluster Setup -

MASTER NODE runnig - Namenode, Resourcemanager ,proxyserver.

SLAVE NODE running -Datanode,Nodemanager.

When I run single workflow job means it succeeds. But when I try to run more than one Workflow job i.e. both the jobs are in accepted state

Inspecting the error log, I drill down the problem as,

014-12-24 21:00:36,758 [JobControl] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,145 [communication thread] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:52406. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,199 [communication thread] INFO  org.apache.hadoop.mapred.Task  - Communication exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable: no further information; Host Details : local host is: "SystemName/127.0.0.1"; destination host is: "172.16.***.***":52406; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1415)
 at org.apache.hadoop.ipc.Client.call(Client.java:1364)
 at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
 at $Proxy9.ping(Unknown Source)
 at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:742)
 at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.SocketException: Network is unreachable: no further information
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
 at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
 at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
 at org.apache.hadoop.ipc.Client.call(Client.java:1382)
 ... 5 more

Heart beat
Heart beat
.
.

In the above running jobs, if I kill any one launcher job manually (hadoop job -kill <launcher-job-id>) mean all jobs get succeeded. So I think the problem is more than one launcher job running simultaneously mean job will meet deadlock..

If anyone know the reason and solution for above problem. Please do me the favor as soon as possible.

回答1:

The problem is with the Queue, When we running the Job in SAME QUEUE(DEFAULT) with above cluster setup the Resourcemanager is responsible to run mapreduce job in the salve node. Due to lack of resource in slave node the job running in the queue will meet Deadlock situation.

In order to over come this issue we need to split the Mapreduce job by means of Triggering the mapreduce job in different queue.

you can do this by setting this part in the pig action inside your oozie workflow.xml

<configuration>
<property>
  <name>mapreduce.job.queuename</name>
  <value>launcher2</value>
</property>

NOTE: This solution only for SMALL CLUSTER SETUP

回答2:

I tried below solution it works perfectly for me.

1) Change the Hadoop schedule type from capacity scheduler to fair scheduler. Because for small cluster each queue assign some memory size (2048MB) to complete single map reduce job. If more than one map reduce job run in single queue mean it met deadlock.

Solution: add below property to yarn-site.xml

  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.scheduler.fair.allocation.file</name>
    <value>file:/%HADOOP_HOME%/etc/hadoop/fair-scheduler.xml</value>
  </property>

2) By default Hadoop Total memory size was allot as 8GB.

So if we run two mapreduce program memory used by Hadoop get more than 8GB so it met deadlock.

Solution: Increase the size of Total Memory of nodemanager using following properties at yarn-site.xml

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>20960</value>
  </property>
  <property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>
  </property>
  <property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
  </property>

So If user try to run more than two mapreduce program mean he need to increase nodemanager or he need to increase the size of total memory of Hadoop (note: Increasing the size will reduce the system usage memory. Above property file able to run 10 map reduce program concurrently.)