可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.

The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs, sheet2.xlsx etc. and put them into sheets directory.

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?

回答1:

For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows

回答2:

We need spark-excel library for this, can be obtained from

https://github.com/crealytics/spark-excel#scala-api

clone the git project from above github link and build using "sbt package"
Using Spark 2 to run the spark-shell

spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client

Import the necessary

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)

Set excel doc path

val document = "path to excel doc"

Execute the below function for creating dataframe out of it

val dataDF = sqlContext.read
                          .format("com.crealytics.spark.excel")
                          .option("sheetName", "Sheet Name")
                          .option("useHeader", "true")
                          .option("treatEmptyValuesAsNulls", "false")
                          .option("inferSchema", "false")
                          .option("location", document)
                          .option("addColorColumns", "false")
                          .load(document)

That's all! now you can perform the Dataframe operation on the dataDF object.