Is it possible to save files in Hadoop without sav

2019-02-19 17:23发布

问题:

Is it possible to save files in Hadoop without saving them in local file system? I would like to do something like shown below however I would like to save file directly in HDFS. At the moment I save files in documents directory and only then I can save them in HDFS for instance using hadoop fs -put.

class DataUploadView(GenericAPIView):

    def post(self, request):

            myfile = request.FILES['photo']
            fs = FileSystemStorage(location='documents/')
            filename = fs.save(myfile.name, myfile)
            local_path = 'my/path/documents/' + str(myfile.name)            
            hdfs_path = '/user/user1/' + str(myfile.name)
            run(['hadoop', 'fs', '-put', local_path, hdfs_path], shell=True)            

回答1:

Hadoop has REST APIs that allow you to create files via WebHDFS.

So you could write your own create based on the REST APIs using a python library like requests for doing the HTTP. However, there are also several python libraries that support Hadoop/HDFS and already use the REST APIs or that use the RPC mechanism via libhdfs.

  • pydoop
  • hadoopy
  • snakebite
  • pywebhdfs
  • hdfscli
  • pyarrow

Just make sure you look for how to create a file rather than having the python library call hdfs dfs -put or hadoop fs -put.

See the following for more information:

  • pydoop vs hadoopy - hadoop python client
  • List all files in HDFS Python without pydoop
  • A Guide to Python Frameworks for Hadoop
  • Native Hadoop file system (HDFS) connectivity in Python
  • PyArrow
  • https://github.com/pywebhdfs/pywebhdfs
  • https://github.com/spotify/snakebite
  • https://crs4.github.io/pydoop/api_docs/hdfs_api.html
  • https://hdfscli.readthedocs.io/en/latest/
  • WebHDFS REST API:Create and Write to a File


回答2:

Here's how to download a file directly to HDFS with Pydoop:

import os
import requests
import pydoop.hdfs as hdfs


def dl_to_hdfs(url, hdfs_path):
    r = requests.get(url, stream=True)
    with hdfs.open(hdfs_path, 'w') as f:
        for chunk in r.iter_content(chunk_size=1024):
            f.write(chunk)


URL = "https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tar.xz"
dl_to_hdfs(URL, os.path.basename(URL))

The above snippet works for a generic URL. If you already have the file as a Django UploadedFile, you can probably use its .chunks method to iterate through the data.



回答3:

Python is installed in your Linux. It can access only local files. It cannot directly access files in HDFS.

In order to save/put the files directly to HDFS, you need to use any of these below:

  • Spark: Use Dstream for streaming files

  • Kafka: matter of setting up configuration file. Best for streaming data.

  • Flume: set up configuration file. Best for static files.