How to write dataset object to excel in spark java

2019-04-02 05:15发布

问题:

I Am reading excel file using com.crealytics.spark.excel package. Below is the code to read an excel file in spark java.

    Dataset<Row> SourcePropertSet = sqlContext.read()
               .format("com.crealytics.spark.excel")
               .option("location", "D:\\5Kto10K.xlsx")
               .option("useHeader", "true")
               .option("treatEmptyValuesAsNulls", "true")
               .option("inferSchema", "true")
               .option("addColorColumns", "false")
               .load("com.databricks.spark.csv");

But I tried with the same (com.crealytics.spark.excel) package to write dataset object to an excel file in spark java.

    SourcePropertSet.write()
          .format("com.crealytics.spark.excel")
          .option("useHeader", "true")
          .option("treatEmptyValuesAsNulls", "true")
          .option("inferSchema", "true")
          .option("addColorColumns", "false").save("D:\\resultset.xlsx");

But i am getting below error.

java.lang.RuntimeException: com.crealytics.spark.excel.DefaultSource does not allow create table as select.

And even I tried with org.zuinnote.spark.office.excel package also. below is the code for that.

    SourcePropertSet.write()
             .format("org.zuinnote.spark.office.excel")
             .option("write.locale.bcp47", "de") 
             .save("D:\\result");

i have added following dependencies in my pom.xml

<dependency>
              <groupId>com.github.zuinnote</groupId>
              <artifactId>hadoopoffice-fileformat</artifactId>
              <version>1.0.0</version>
          </dependency>
        <dependency>
            <groupId>com.github.zuinnote</groupId>
            <artifactId>spark-hadoopoffice-ds_2.11</artifactId>
            <version>1.0.3</version>
        </dependency> 

But I am getting below error.

java.lang.IllegalAccessError: tried to access method org.zuinnote.hadoop.office.format.mapreduce.ExcelFileOutputFormat.getSuffix(Ljava/lang/String;)Ljava/lang/String; from class org.zuinnote.spark.office.excel.ExcelOutputWriterFactory

Please help me to write dataset object to an excel file in spark java.

回答1:

Looks like the library you chose, com.crealytics.spark.excel, does not have any code related to writing excel files. Underneath it uses Apache POI for reading Excel files, there are also few examples.

The good news are that CSV is a valid Excel file, and you may use spark-csv to write it. You need to change your code like this:

sourcePropertySet.write
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .save("D:\\resultset.csv");

Keep in mind that Spark makes 1 output file per partition, and you might want to do .repartition(1) to have exactly one result file.



回答2:

The error you face when writing comes from an old version of the HaodoopOffice library. Please make sure that you have only version 1.0.3 or better 1.0.4 as a dependency. Can you provide your build file? The following should work:

 SourcePropertSet.write()
             .format("org.zuinnote.spark.office.excel")
             .option("spark.write.useHeader",true)
             .option("write.locale.bcp47", "us") 
             .save("D:\\result");

Version 1.0.4 of the Spark2 data source for HadoopOffice also supports inferring the schema when reading:

 Dataset<Row> SourcePropertSet = sqlContext.read()
               .format("org.zuinnote.spark.office.excel")
               .option("spark.read.useHeader", "true")
               .option("spark.read.simpleMode", "true")
               .load("D:\\5Kto10K.xlsx");

Please note that it is not recommended to mix different Excel data sources based on POI in one application.

More information here: https://github.com/ZuInnoTe/spark-hadoopoffice-ds