Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)
What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:
for line in open("myfile", "r"):
# do some processing
I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.
I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess
module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.
Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?
cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
If there is another way to achieve what I described above without using an external library, I'm also pretty open.
Thanks for any help !