Is it possible to save files in Hadoop without saving them in local file system? I would like to do something like shown below however I would like to save file directly in HDFS. At the moment I save files in documents directory and only then I can save them in HDFS for instance using hadoop fs -put
.
class DataUploadView(GenericAPIView):
def post(self, request):
myfile = request.FILES['photo']
fs = FileSystemStorage(location='documents/')
filename = fs.save(myfile.name, myfile)
local_path = 'my/path/documents/' + str(myfile.name)
hdfs_path = '/user/user1/' + str(myfile.name)
run(['hadoop', 'fs', '-put', local_path, hdfs_path], shell=True)
Hadoop has REST APIs that allow you to create files via WebHDFS.
So you could write your own create
based on the REST APIs using a python library like requests
for doing the HTTP. However, there are also several python libraries that support Hadoop/HDFS and already use the REST APIs or that use the RPC mechanism via libhdfs
.
- pydoop
- hadoopy
- snakebite
- pywebhdfs
- hdfscli
- pyarrow
Just make sure you look for how to create a file rather than having the python library call hdfs dfs -put
or hadoop fs -put
.
See the following for more information:
- pydoop vs hadoopy - hadoop python client
- List all files in HDFS Python without pydoop
- A Guide to Python Frameworks for Hadoop
- Native Hadoop file system (HDFS) connectivity in Python
- PyArrow
- https://github.com/pywebhdfs/pywebhdfs
- https://github.com/spotify/snakebite
- https://crs4.github.io/pydoop/api_docs/hdfs_api.html
- https://hdfscli.readthedocs.io/en/latest/
- WebHDFS REST API:Create and Write to a File
Here's how to download a file directly to HDFS with Pydoop:
import os
import requests
import pydoop.hdfs as hdfs
def dl_to_hdfs(url, hdfs_path):
r = requests.get(url, stream=True)
with hdfs.open(hdfs_path, 'w') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
URL = "https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tar.xz"
dl_to_hdfs(URL, os.path.basename(URL))
The above snippet works for a generic URL. If you already have the file as a Django UploadedFile
, you can probably use its .chunks
method to iterate through the data.
Python is installed in your Linux. It can access only local files. It cannot directly access files in HDFS.
In order to save/put the files directly to HDFS, you need to use any of these below:
Spark: Use Dstream for streaming files
Kafka: matter of setting up configuration file. Best for streaming data.
Flume: set up configuration file. Best for static files.