Reading remote HDFS file with Java

2019-02-22 04:58发布

问题:

I’m having a bit of trouble with a simple Hadoop install. I’ve downloaded hadoop 2.4.0 and installed on a single CentOS Linux node (Virtual Machine). I’ve configured hadoop for a single node with pseudo distribution as described on the apache site (http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html). It starts with no issues in the logs and I can read + write files using the “hadoop fs” commands from the command line.

I’m attempting to read a file from the HDFS on a remote machine with the Java API. The machine can connect and list directory contents. It can also determine if a file exists with the code:

Path p=new Path("hdfs://test.server:9000/usr/test/test_file.txt");
FileSystem fs = FileSystem.get(new Configuration());
System.out.println(p.getName() + " exists: " + fs.exists(p));

The system prints “true” indicating it exists. However, when I attempt to read the file with:

BufferedReader br = null;
try {
    Path p=new Path("hdfs://test.server:9000/usr/test/test_file.txt");
    FileSystem fs = FileSystem.get(CONFIG);
    System.out.println(p.getName() + " exists: " + fs.exists(p));

    br=new BufferedReader(new InputStreamReader(fs.open(p)));
    String line = br.readLine();

    while (line != null) {
        System.out.println(line);
        line=br.readLine();
    }
}
finally {
    if(br != null) br.close();
}

this code throws the exception:

Exception in thread "main" org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-13917963-127.0.0.1-1398476189167:blk_1073741831_1007 file=/usr/test/test_file.txt

Googling gave some possible tips but all checked out. The data node is connected, active, and has enough space. The admin report from hdfs dfsadmin –report shows:

Configured Capacity: 52844687360 (49.22 GB)
Present Capacity: 48507940864 (45.18 GB)
DFS Remaining: 48507887616 (45.18 GB)
DFS Used: 53248 (52 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Datanodes available: 1 (1 total, 0 dead)

Live datanodes:
Name: 127.0.0.1:50010 (test.server)
Hostname: test.server
Decommission Status : Normal
Configured Capacity: 52844687360 (49.22 GB)
DFS Used: 53248 (52 KB)
Non DFS Used: 4336746496 (4.04 GB)
DFS Remaining: 48507887616 (45.18 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.79%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Last contact: Fri Apr 25 22:16:56 PDT 2014

The client jars were copied directly from the hadoop install so no version mismatch there. I can browse the file system with my Java class and read file attributes. I just can’t read the file contents without getting the exception. If I try to write a file with the code:

FileSystem fs = null;
BufferedWriter br = null;

System.setProperty("HADOOP_USER_NAME", "root");

try {
    fs = FileSystem.get(new Configuraion());

    //Path p = new Path(dir, file);
    Path p = new Path("hdfs://test.server:9000/usr/test/test.txt");
    br = new BufferedWriter(new OutputStreamWriter(fs.create(p,true)));
    br.write("Hello World");
}
finally {
    if(br != null) br.close();
    if(fs != null) fs.close();
}

this creates the file but doesn’t write any bytes and throws the exception:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /usr/test/test.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.

Googling for this indicated a possible space issue but from the dfsadmin report, it seems there is plenty of space. This is a plain vanilla install and I can’t get past this issue.

The environment summary is:

SERVER:

Hadoop 2.4.0 with pseudo-distribution (http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html)

CentOS 6.5 Virtual Machine 64 bit server Java 1.7.0_55

CLIENT:

Windows 8 (Virtual Machine) Java 1.7.0_51

Any help is greatly appreciated.

回答1:

Hadoop error messages are frustrating. Often they don't say what they mean and have nothing to do with the real issue. I've seen problems like this occur when the client, namenode, and datanode cannot communicate properly. In your case I would pick one of two issues:

  • Your cluster runs in a VM and its virtualized network access to the client is blocked.
  • You are not consistently using fully-qualified domain names (FQDN) that resolve identically between the client and host.

The host name "test.server" is very suspicious. Check all of the following:

  • Is test.server a FQDN?
  • Is this the name that has been used EVERYWHERE in your conf files?
  • Can the client and all hosts forward and reverse resolve "test.server" and its IP address and get the same thing?
  • Are IP addresses being used instead of FQDN anywhere?
  • Is "localhost" being used anywhere?

Any inconsistency in the use of FQDN, hostname, numeric IP, and localhost must be removed. Do not ever mix them in your conf files or in your client code. Consistent use of FQDN is preferred. Consistent use of numeric IP usually also works. Use of unqualified hostname, localhost, or 127.0.0.1 cause problems.



回答2:

The answer above is pointing to the right direction. Allow me to add the following:

  1. Namenode does NOT directly read or write data.
  2. Client (your Java program using Direct access to HDFS) interacts with Namenode to update HDFS namespace and retrieve block locations for reading/writing.
  3. Client interacts directly with Datanode to read/write data.

You were able to list directory contents because hostname:9000was accessible to your client code. You were doing the number 2 above.
To be able to read and write, your client code needs access to the Datanode (number 3). The default port for Datanode DFS data transfer is 50010. Something was blocking your client communication to hostname:50010. Possibly a firewall or SSH tunneling configuration problem.
I was using Hadoop 2.7.2, so maybe you have a different port number setting.



回答3:

We need to make sure to have configuration with fs.default.name space set such as

configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");

Below I've put a piece of sample code:

 Configuration configuration = new Configuration();
 configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");
 FileSystem fs = pt.getFileSystem(configuration);
 BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
 String line = null;
 line = br.readLine
 while (line != null) {
  try {
    line = br.readLine
    System.out.println(line);
  }
}