Importing a lzo file into java spark as dataset

2019-07-28 02:08发布

I have some data in tsv format compressed using lzo. Now, I would like to use these data in a java spark program.

At the moment, I am able to decompress the files and then import them in Java as text files using

    SparkSession spark = SparkSession.builder()
            .master("local[2]")
            .appName("MyName")
            .getOrCreate();

    Dataset<Row> input = spark.read()
            .option("sep", "\t")
            .csv(args[0]);

    input.show(5);   // visually check if data were imported correctly

where I have passed the path to the decompressed file in the first argument. If I pass the lzo file as an argument, the result of show is illegible garbage.

Is there a way to make it work? I use IntelliJ as an IDE and the project is set-up in Maven.

1条回答
Lonely孤独者°
2楼-- · 2019-07-28 02:43

I found a solution. It consists of two parts: installing the hadoop-lzo package and configuring it; after doing this, the code will remain the same as in the question, provided one is OK with the lzo file being imported in a single partition.

In the following I will explain how to do it for a maven project set up in IntelliJ.

  • Installing the package hadoop-lzo: you need to modify the pom.xml file that is in your maven project folder. It should contain the following excerpt:

    <repositories>
        <repository>
            <id>twitter-twttr</id>
            <url>http://maven.twttr.com</url>
        </repository>
    </repositories>
    
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>
    
    <dependencies>
    
        <dependency>
            <!-- Apache Spark main library -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
    
        <dependency>
            <!-- Packages for datasets and dataframes -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
    
        <!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo -->
        <dependency>
            <groupId>com.hadoop.gplcompression</groupId>
            <artifactId>hadoop-lzo</artifactId>
            <version>0.4.20</version>
        </dependency>
    
    </dependencies>
    

This will activate the maven Twitter repository that contains the package hadoop-lzo and make hadoop-lzo available for the project.

  • The second step is to create a core-site.xml file to tell hadoop that you have installed a new codec. It should be placed somewhere in the program folders. I put it under src/main/resources/core-site.xml and marked the folder as a resource (right click on the folder from the IntelliJ Project panel -> Mark Directory as -> Resources root). The core-site.xml file should contain:

    <configuration>
        <property>
            <name>io.compression.codecs</name>
            <value>org.apache.hadoop.io.compress.DefaultCodec,
                com.hadoop.compression.lzo.LzoCodec,
                com.hadoop.compression.lzo.LzopCodec,
                org.apache.hadoop.io.compress.GzipCodec,
                org.apache.hadoop.io.compress.BZip2Codec</value>
        </property>
        <property>
            <name>io.compression.codec.lzo.class</name>
            <value>com.hadoop.compression.lzo.LzoCodec</value>
        </property>
    </configuration>
    

And that's it! Run your program again and it should work!

查看更多
登录 后发表回答