Upload data to HDFS with Java API

2019-03-02 09:49发布

问题:

I've searched for some time now and none of the solutions seem to work for me.

Pretty straightforward - I want to upload data from my local file system to HDFS using the Java API. The Java program will be run on a host that has been configured to talk to a remote Hadoop cluster through shell (i.e. hdfs dfs -ls, etc.).

I have included the below dependencies in my project:

hadoop-core:1.2.1
hadoop-common:2.7.1
hadoop-hdfs:2.7.1

I have code that looks like the following:

 File localDir = ...;
 File hdfsDir = ...;
 Path localPath = new Path(localDir.getCanonicalPath());
 Path hdfsPath = new Path(hdfsDir.getCanonicalPath());
 Configuration conf = new Configuration();
 conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
 conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
 Filesystem fs = FileSystem.get(configuration);
 fs.getFromLocalFile(localPath, hdfsPath);

The local data is not being copied to the Hadoop cluster, but no errors are reported and no exceptions are thrown. I've enabled TRACE logging for the org.apache.hadoop package. I see the following outputs:

 DEBUG Groups:139 -  Creating new Groups object
 DEBUG Groups:139 -  Creating new Groups object
 DEBUG Groups:59 - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
 DEBUG Groups:59 - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
 DEBUG UserGroupInformation:147 - hadoop login
 DEBUG UserGroupInformation:147 - hadoop login
 DEBUG UserGroupInformation:96 - hadoop login commit
 DEBUG UserGroupInformation:96 - hadoop login commit
 DEBUG UserGroupInformation:126 - using local user:UnixPrincipal: willra05
 DEBUG UserGroupInformation:126 - using local user:UnixPrincipal: willra05
 DEBUG UserGroupInformation:558 - UGI loginUser:<username_redacted>
 DEBUG UserGroupInformation:558 - UGI loginUser:<username_redacted>
 DEBUG FileSystem:1441 - Creating filesystem for file:///
 DEBUG FileSystem:1441 - Creating filesystem for file:///
 DEBUG FileSystem:1290 - Removing filesystem for file:///
 DEBUG FileSystem:1290 - Removing filesystem for file:///
 DEBUG FileSystem:1290 - Removing filesystem for file:///
 DEBUG FileSystem:1290 - Removing filesystem for file:///

Can anyone assist in helping me resolve this issue?

EDIT 1: (09/15/2015)

I've removed 2 of the Hadoop dependencies - I'm only using one now:

hadoop-core:1.2.1

My code is now the following:

File localDir = ...;
File hdfsDir = ...;
Path localPath = new Path(localDir.getCanonicalPath());
Path hdfsPath = new Path(hdfsDir.getCanonicalPath());
Configuration conf = new Configuration();
fs.getFromLocalFile(localPath, hdfsPath);

I was previously executing my application with the following command:

$ java -jar <app_name>.jar <app_arg1> <app_arg2> ...

Now I'm executing it with this command:

$ hadoop jar <app_name>.jar <app_arg1> <app_arg2> ...

With these changes, my application now interacts with HDFS as intended. To my knowledge, the hadoop jar command is meant only for Map Reduce jobs packaged as an executable jar, but these changes did the trick for me.

回答1:

i am not sure about the approach you are following, but below is one way data can be uploaded to hdfs using java libs :

//imports required 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;

//some class here .....
Configuration conf = new Configuration();
conf.set("fs.defaultFS", <hdfs write endpoint>);
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(<src>, <dst>);

Also if you have hadoop conf xmls locally, you can include them in you class path. Then hadoop fs details will automatically be picked up at runtime, and you will not need to set "fs.defaultFS" . Also if you are running in old hdfs version you might need to use "fs.default.name" instead of "fs.defaultFS". If you are not sure of the hdfs endpoint, it is usually the hdfs namenode url . Here is example from previous similar question copying directory from local system to hdfs java code



回答2:

Two things:

  1. If you are creating a Hadoop client, it could be better to add hadoop-client dependency. It includes all the sub-modules required dependencies. https://github.com/apache/hadoop/blob/2087eaf684d9fb14b5390e21bf17e93ac8fea7f8/hadoop-client/pom.xml. Unless the size of the Jar is a concern and if you are very sure that you won't require another dependency.
  2. When you execute a job using hadoop command the class that it is executed is RunJar and not your driver class. Then RunJar executes your job. For more details you can see the code here: https://github.com/apache/hadoop/blob/2087eaf684d9fb14b5390e21bf17e93ac8fea7f8/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/RunJar.java#L139

If you review the createClassLoader method in the RunJar class, you will notice that several locations are being included in the classpath.

Then, if you are executing your class directly using the java -jar command you could be ignoring all the other required steps to execute your job in hadoop that hadoop jar are doing.



回答3:

Kasa, you need to use the method

public static FileSystem get(URI uri,Configuration conf)

to get fs, the uri params is necessary if you use java -jar command.



标签: java hadoop hdfs