Decrypting Hadoop Snappy File

2019-07-21 02:44发布

问题:

So I'm having some issues decrypting a snappy file from HDFS. If I use hadoop fs -text I am able to uncompress and output the file just file. However if I use hadoop fs -copyToLocal and try to uncompress the file with python-snappy I get

snappy.UncompressError: Error while decompressing: invalid input

My python program is very simple and looks like this:

import snappy

with open (snappy_file, "r") as input_file:
    data = input_file.read()
    uncompressed = snappy.uncompress(data)
    print uncompressed

This fails miserably for me. So I tried another text, I took the output from hadoop fs -text and compressed it using the python-snappy library. I then outputted this to a file. I was able to then read this file in and uncompress it just fine.

AFAIK snappy is backwards compatible between version. My python code is using the latest snappy version and I'm guessing hadoop is using an older snappy version. Could this be a problem? Or is there something else I am missing here?

回答1:

Okay well I figured it out. Turns out that what I was using was the raw mode decompress on a file that was compressed using hadoop's framing format. Even when I tried the StreamDecompressor in 0.5.1 it still failed due to a framing error. python-snappy 0.5.1 defaults to the new snappy framing format and thus can't decompress the hadoop snappy files.

Turns out that the master version, 0.5.2, has added support for the hadoop framing format. Once I built this and imported it I was able to decompress the file easily:

with open (snappy_file, "r") as input_file:
  data = input_file.read()
  decompressor = snappy.hadoop_snappy.StreamDecompressor()
  uncompressed = decompressor.decompress(data)

Now the only issue is that this isn't technically a pip version yet, so I guess I'll have to wait or just use the build from source.