I have LZ4 compressed data in HDFS and I'm trying to decompress it in Apache Spark into a RDD. As far as I can tell, the only method in JavaSparkContext
to read data from HDFS is textFile
which only reads data as it is in HDFS. I have come across articles on CompressionCodec
but all of them explain how to compress output to HDFS whereas I need to decompress what is already on HDFS.
I am new to Spark so I apologize in advance if I missed something obvious or if my conceptual understanding is incorrect but it would be great if someone could point me in the right direction.