yarn hadoop 2.4.0: info message: ipc.Client Retryi

i've searched for two days for a solution. but nothing worked.

First, i'm new to the whole hadoop/yarn/hdfs topic and want to configure a small cluster.

the message above doesn't show up everytime i run an example from the mapreduce-examples.jar sometimes teragen works, sometimes not. in some cases the whole job failed, in others the job finishes successfully. sometimes the job failes, without printing the message above.

14/06/08 15:42:46 INFO ipc.Client: Retrying connect to server: FQDN-HOSTNAME/XXX.XX.XX.XXX:53022. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

this message is print 30 times. also the port (in code example: 53022) changes with every time a job is started. if job finished succesfuly, this is print

14/06/08 15:34:20 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 running in uber mode : false
14/06/08 15:34:20 INFO mapreduce.Job:  map 100% reduce 100%
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 completed successfully

if it fails,this is shown.

INFO mapreduce.Job: Job job_1402234146062_0005 failed with state FAILED due to: Task failed task_1402234146062_0005_m_000002
Job failed as tasks failed. failedMaps:1 failedReduces:0

in this case, some tasks failed. but in log files of nodemanager, datanode, resourcemanager, ... is no reason or message to find.

INFO mapreduce.Job: Task Id : attempt_1402234146062_0006_m_000002_1, Status : FAILED

Additional Information about my Configuration: used OS: centOS 6.5 Java Version: OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13) OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.nodemanager.address</name>
                <value>FQDN-HOSTNAME:8050</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                  <name>yarn.nodemanager.localizer.address</name>
                  <value>FQDN-HOSTNAME:8040</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
                  <name>yarn.resourcemanager.resource-tracker.address</name>
                  <value>FQDN-HOSTNAME:8025</value>
        </property>
        <property>
                  <name>yarn.resourcemanager.scheduler.address</name>
                  <value>FQDN-HOSTNAME:8030</value>
        </property>
        <property>
                  <name>yarn.resourcemanager.address</name>
                  <value>FQDN-HOSTNAME:8032</value>
        </property>
</configuration>

hdfs-site.xml

    <configuration>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                   <name>dfs.permissions </name>
                   <value>false </value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///var/data/hadoop/hdfs/nn</value>
        </property>
        <property>
                <name>fs.checkpoint.dir</name>
                <value>file:///var/data/hadoop/hdfs/snn</value>
        </property>
        <property>
                <name>fs.checkpoint.edits.dir</name>
                <value>file:///var/data/hadoop/hdfs/snn</value>
                <name>fs.checkpoint.edits.dir</name>
                <value>file:///var/data/hadoop/hdfs/snn</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:///var/data/hadoop/hdfs/dn</value>
        </property>
</configuration>

mapred-site.xml

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.cluster.temp.dir</name>
                <value>/mapred/tempDir</value>
        </property>
        <property>
                <name>mapreduce.cluster.local.dir</name>
                <value>/mapred/localDir</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>FQDN-HOSTNAME:10020</value>
        </property>
</configuration>

I hope somebody could help me. :) Thank you, Norman

标签： hadoop mapreduce ipc yarn

5条回答

你好瞎i

2楼-- · 2019-07-16 00:22

Definitely a bug, this post provides a clearer insight into what is happening. https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/P1rfMQmYVWk/eARZXHUTkW0J

We are planning on getting around this issue by reducing the ephemeral port range, thus limiting what ports are grabbed, and then configuring iptables to allow for that port range. Setting the port ranges is explained here - http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html

0人赞添加讨论(0) 举报

爷、活的狠高调

3楼-- · 2019-07-16 00:27

Wow! Are these answers for real?? Talking about FQDN when the job clearly completes...as long as firewall is disabled?? And the OP even put the detailed log messages / configuration.

C'mon guys - RTFQ. The problem is that yarn.app.mapreduce.am.job.client.port-range is not being honored. I'm running into it also.

Firewall off...all is well (and I can see the ephemeral ports from yarn job).

Firewall on...all times outs (eventually).

Horton completely ignores this question on other boards.

So here's a log output from a job which demonstrates the problem. In first case, I have the firewall enabled on the client(s) based on Horton's doc (along with other ports I discovered by looking very closely at my installation). You will see the process timing out...and then all of a sudden working. Because I disabled the firewall after watching the job output :)

2015-01-15 16:48:22,943 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: de-luster-l2723nraqsy5-ywhniidze3lb-qfk4asn77vc5/10.0.0.41:52015. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-01-15 16:48:23,349 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/local/usercache/l.admin/appcache/application_1420482341308_0020
2015-01-15 16:48:24,122 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-01-15 16:48:24,656 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2015-01-15 16:48:24,724 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@7f94ee59
2015-01-15 16:48:24,792 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=534354336, maxSingleShuffleLimit=133588584, mergeThreshold=352673888, ioSortFactor=100, memToMemMergeOutputsThreshold=100

Did ya see it?? Problem with timeout...then all of a sudden Shuffle commences. Nothing to do with FQDNs after all :)

But I'll find the solution and post back.

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2019-07-16 00:28

The job finishes sometimes successfully because when you have one reducer and that reduce task by chance is sent to a working node manager then it becomes successful job.

You have to make sure that FQDN-HOSTNAME is written exactly the same way in the slaves file. If I remember correctly, my solution was that I removed the entry for the hostname mapping in /etc/hosts, that is commenting it out like this:

#127.0.0.1    FQDN-HOSTNAME

0人赞添加讨论(0) 举报

三岁会撩人

5楼-- · 2019-07-16 00:33

Another possible solution for this, is to check for the firewall in all the nodes. If you're dealing with iptables, you can run this on every node:

# /etc/init.d/iptables save
# /etc/init.d/iptables stop

That will stop the firewall until next restart, but it should be enough for you to test the cluster. You don't have to restart yarn or anything, just run the job again.

If you want to completely stop the FW:

# chkconfig iptables off

0人赞添加讨论(0) 举报

叛逆

6楼-- · 2019-07-16 00:37

This is a bug in how the MR AppMaster starts up with ephemeral ports. It exists in Hadoop 2.6.0 release version as well.
I have figured out a fix to this bug and created a JIRA on the MAPREDUCE project along with a comment on how to fix it.

https://issues.apache.org/jira/browse/MAPREDUCE-6338

0人赞添加讨论(0) 举报

yarn hadoop 2.4.0: info message: ipc.Client Retryi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间