I am learning spark and scala. I am well versed in java, but not so much in scala. I am going through a tutorial on spark, and came across the following line of code, which has not been explained:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
(sc
is the SparkContext instance)
I know the concepts behind scala implicits (atleast I think I know). Could somebody explain to me what exactly is meant by the import
statement above? What implicits
are bound to the sqlContext
instance when it is instantiated and how? Are these implicits defined inside the SQLContext class?
EDIT
The following seems to work for me as well (fresh code):
val sqlc = new SQLContext(sc)
import sqlContext.implicits._
In this code just above. what exactly is sqlContext and where is it defined?
From ScalaDoc:
sqlContext.implicits
contains "(Scala-specific) Implicit methods available in Scala for converting common Scala objects into DataFrames. "
And is also explained in Spark programming guide:
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
For example in the code below .toDF()
won't work unless you will import sqlContext.implicits
:
val airports = sc.makeRDD(Source.fromFile(airportsPath).getLines().drop(1).toSeq, 1)
.map(s => s.replaceAll("\"", "").split(","))
.map(a => Airport(a(0), a(1), a(2), a(3), a(4), a(5), a(6)))
.toDF()
What implicits are bound to the sqlContext instance when it is
instantiated and how? Are these implicits defined inside the SQLContext class?
Yes they are defined in object implicits
inside SqlContext
class, which extends SQLImplicits.scala. It looks there are two types of implicit conversions defined there:
- RDD to DataFrameHolder conversion, which enables using above mentioned
rdd.toDf()
.
- Various instances of Encoder which are "Used to convert a JVM object of type T to and from the internal Spark SQL representation."