Writing data to Hadoop

2019-01-21 20:44发布

问题:

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.

回答1:

Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/

You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.

There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.



回答2:

There is an API in Java. You can use it by including the Hadoop code in your project. The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g * http://hadoop.apache.org/common/docs/

For your particular problem, have a look at: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html (this applies to the latest release, consult other JavaDocs for different versions!)

A typical call would be: Filesystem.get(new JobConf()).create(new Path("however.file")); Which returns you a stream you can handle with regular JavaIO.



回答3:

For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.

Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).

The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.

This gives me excellent transfer performance, since multiple files are read/written at the same time.

Maybe not the answer you were looking for, but hopefully helpful anyway :-).



回答4:

About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.

Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.

If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.

Hoop/HttpFS runs as its own standalone service.

There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.

Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)



回答5:

It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:

These projects (enumerated below) allow HDFS to be mounted (on most flavors of Unix) as a standard file system using the mount command. Once mounted, the user can operate on an instance of hdfs using standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use standard Posix libraries like open, write, read, close from C, C++, Python, Ruby, Perl, Java, bash, etc.

Later it describes these projects

  • contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
  • fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
  • hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
  • webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write NFS access
  • HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs sequentially.

I haven't tried any of these, but I will update the answer soon as I have the same need as the OP



回答6:

You can now also try to use Talend, which includes components for Hadoop integration.



回答7:

you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS



回答8:

You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.



标签: hadoop hdfs