HDFS error: could only be replicated to 0 nodes, i

2020-01-26 03:41发布

I've created a ubuntu single node hadoop cluster in EC2.

Testing a simple file upload to hdfs works from the EC2 machine, but doesn't work from a machine outside of EC2.

I can browse the the filesystem through the web interface from the remote machine, and it shows one datanode which is reported as in service. Have opened all tcp ports in the security from 0 to 60000(!) so I don't think it's that.

I get the error

java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)

at org.apache.hadoop.ipc.Client.call(Client.java:905)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

namenode log just gives the same error. Others don't seem to have anything interesting

Any ideas?

Cheers

16条回答
Melony?
2楼-- · 2020-01-26 04:01

Don't format the name node immediately. Try stop-all.sh and start it using start-all.sh. If the problem persists, go for formatting the name node.

查看更多
3楼-- · 2020-01-26 04:02

I realize I'm a little late to the party, but I wanted to post this for future visitors of this page. I was having a very similar problem when I was copying files from local to hdfs and reformatting the namenode did not fix the problem for me. It turned out that my namenode logs had the following error message:

2012-07-11 03:55:43,479 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-920118459-192.168.3.229-50010-1341506209533, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Too many open files
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createNewFile(File.java:883)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:491)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:462)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.createTmpFile(FSDataset.java:1628)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1514)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:113)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:381)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)

Apparently, this is a relatively common problem on hadoop clusters and Cloudera suggests increasing the nofile and epoll limits (if on kernel 2.6.27) to work around it. The tricky thing is that setting nofile and epoll limits is highly system dependent. My Ubuntu 10.04 server required a slightly different configuration for this to work properly, so you may need to alter your approach accordingly.

查看更多
唯我独甜
4楼-- · 2020-01-26 04:02

Follow the below steps:
1. Stop dfs and yarn.
2. Remove datanode and namenode directories as specified in the core-site.xml.
3. Start dfs and yarn as follows:

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
查看更多
\"骚年 ilove
5楼-- · 2020-01-26 04:09

WARNING: The following will destroy ALL data on HDFS. Do not execute the steps in this answer unless you do not care about destroying existing data!!

You should do this:

  1. stop all hadoop services
  2. delete dfs/name and dfs/data directories
  3. hdfs namenode -format Answer with a capital Y
  4. start hadoop services

Also, check the diskspace in your system and make sure the logs are not warning you about it.

查看更多
唯我独甜
6楼-- · 2020-01-26 04:11

I'll try to describe my setup & solution: My setup: RHEL 7, hadoop-2.7.3

I tried to setup standalone Operation first and then Pseudo-Distributed Operation where the latter failed with the same issue.

Although, when I start hadoop with:

sbin/start-dfs.sh

I got the following:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-secondarynamenode-localhost.localdomain.out

which looks promising (starting datanode.. with no failures) - but the datanode wasn't exist indeed.

Another indication was to see that there is no datanode in operation (the below snapshot shows fixed working state):

enter image description here

I've fix that issue by doing:

rm -rf /tmp/hadoop-<user>/dfs/name
rm -rf /tmp/hadoop-<user>/dfs/data

and then start again:

sbin/start-dfs.sh
...
查看更多
在下西门庆
7楼-- · 2020-01-26 04:11

It take me a week to figure out the problem in my situation.

When the client(your program) ask the nameNode for data operation, the nameNode picks up a dataNode and navigate the client to it, by giving the dataNode's ip to the client.

But, when the dataNode host is configured to has multiple ip, and the nameNode gives you the one your client CAN'T ACCESS TO, the client would add the dataNode to exclude list and ask the nameNode for a new one, and finally all dataNode are excluded, you get this error.

So check node's ip settings before you try everything!!!

查看更多
登录 后发表回答