Hadoop safemode recovery - taking lot of time

2019-06-08 09:01发布

问题:

We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services.

609   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode'
610   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode'
611   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker'
612 
613   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait'

On the slave machine, we run the below services.

625   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode'
626   $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker'

The main problem we are facing is, hdfs safemode recovery is taking more than an hour and this is causing delays in our job completion.

Below are the main log messages.

1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO ipc.Client: Retrying connect to server: ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already tried 21 time(s).
2. The reported blocks 283634 needs additional 322258 blocks to reach the threshold 0.9990 of total blocks 606499. Safe mode will be turned off automatically.

The first message is thrown in task trackers log because, job tracker is not started. job tracker didn't start because of hdfs safemode recovery.

The second message is thrown during the recovery process.

Is there something I am doing wrong? How much time does normal hdfs safemode recovery takes? Will there be any speedup, by not starting task trackers till job tracker is started? Are there any known hadoop problems on amazon cluster?

Thanks for your help.

回答1:

The time spent in safe mode is usually proportional to the size of the cluster. That said, normal time is on the order of minutes at most, not hours. There are a few things to check.

  1. Confirm all data nodes are firing up correctly. It's normal for data nodes to take a few seconds or minutes for a large number of blocks to report in. Check the data node logs to see what's happening during start up.
  2. Ensure you have enough name node threads (dfs.namenode.handler.count in hdfs-site.xml) to be able to take care of the number of data nodes that want to check in. The default is 10 which should be fine for clusters up to 20 nodes or so. Beyond that, it may make sense to increase this. You may see retries occurring in the data node logs that would indicate this. This is what the retry messages seems to indicate to me (e.g. retry 21 times).

Hope this helps.