Zip support in Apache Spark

2019-01-12 01:06发布

I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as .zip files. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully.

I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working.

In addition, Spark supports creating a partition for every file under a specified folder, like in the example below:

SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
                                  .setMaster("local[4]");

JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);

JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();

Where C:\input\ points to a directory with multiple files.

In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?

5条回答
做个烂人
2楼-- · 2019-01-12 01:25

Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.

This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.

This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).

查看更多
\"骚年 ilove
3楼-- · 2019-01-12 01:28

Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.

allzip.foreach { x =>
  val zipFileRDD = sc.newAPIHadoopFile(
    x.getPath.toString,
    classOf[ZipFileInputFormat],
    classOf[Text],
    classOf[BytesWritable], hadoopConf)

  zipFileRDD.foreach { y =>
    ProcessFile(y._1.toString, y._2)
  }

https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala

The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

查看更多
仙女界的扛把子
4楼-- · 2019-01-12 01:29

Spark default support compressed files

According to Spark Programming Guide

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").

This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec (docs)

name    | ext      | codec class
-------------------------------------------------------------
bzip2   | .bz2     | org.apache.hadoop.io.compress.BZip2Codec 
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec 
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec 
gzip    | .gz      | org.apache.hadoop.io.compress.GzipCodec 
lz4     | .lz4     | org.apache.hadoop.io.compress.Lz4Codec 
snappy  | .snappy  | org.apache.hadoop.io.compress.SnappyCodec

Source : List the available hadoop codecs

So the above formats and much more possibilities could be achieved simply by calling:

sc.readFile(path)

Reading zip files in Spark

Unfortunately, zip is not on the supported list by default.

I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers (example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me.

My solution

Without any external dependencies, you can load your file with sc.binaryFiles and later on decompress the PortableDataStream reading the content. This is the approach I have chosen.

import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD

implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {

    def readFile(path: String,
                 minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {

      if (path.endsWith(".zip")) {
        sc.binaryFiles(path, minPartitions)
          .flatMap { case (name: String, content: PortableDataStream) =>
            val zis = new ZipInputStream(content.open)
            // this solution works only for single file in the zip
            val entry = zis.getNextEntry
            val br = new BufferedReader(new InputStreamReader(zis))
            Stream.continually(br.readLine()).takeWhile(_ != null)
          }
      } else {
        sc.textFile(path, minPartitions)
      }
    }
  }

using this implicit class, you need to import it and call the readFile method on SparkContext:

import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)

And the implicit class will load your zip file properly and return RDD[String] like it used to.

Note: This only works for single file in the zip archive!
For multiple files in your zip support, check this answer: https://stackoverflow.com/a/45958458/1549135

查看更多
贪生不怕死
5楼-- · 2019-01-12 01:37

You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.

Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/

 file_RDD = sc.binaryFiles( HDFS_path + data_path )

 def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
     try :
         pseudo_file = io.BytesIO( binary_stream_string )
         zf = zipfile.ZipFile( pseudo_file )
         return zf
     except :
         return None

 def read_zip_lines(zipfile_object) :
     file_iter = zipfile_object.open('diff.txt')
     data =  file_iter.readlines() 
     return data

 My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))
查看更多
手持菜刀,她持情操
6楼-- · 2019-01-12 01:47

You can use sc.binaryFiles to read Zip as binary file

val rdd = sc.binaryFiles(path).flatMap { 
    case (name: String, content: PortableDataStream) => new ZipInputStream(content.open) 
}  //=> RDD[ZipInputStream]

And then you can map the ZipInputStream to list of lines:

val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList

But the problem remains that the zip file is not splittable.

查看更多
登录 后发表回答