I have a Scala class that I define like so:
import org.apache.spark.{SparkConf, SparkContext}
object TestObject extends App{
val FAMILY = "data".toUpperCase
override def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf())
sc.parallelize(1 to 10)
.map(getData)
.saveAsTextFile("my_output")
}
def getData(i: Int) = {
( i, FAMILY, "data".toUpperCase )
}
}
I submit it to a YARN cluster like so:
HADOOP_CONF_DIR=/etc/hadoop/conf spark-submit \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.2.1-hadoop2.4.0.jar \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
target/scala-2.11/myjar-assembly-1.1.jar
Unexpectedly, the output looks like the following, indicating that the getData
method can't see the value of FAMILY
:
(1,null,DATA)
(2,null,DATA)
(3,null,DATA)
(4,null,DATA)
(5,null,DATA)
(6,null,DATA)
(7,null,DATA)
(8,null,DATA)
(9,null,DATA)
(10,null,DATA)
What do I need to understand, about fields and scoping and visibility and spark submission and objects and singletons and whatnot, to understand why this is happening? And what should I be doing instead, if I basically want variables defined as "constants" visible to the getData
method?
Figured it out. It's the
App
trait causing trouble. It manifests even in this simple class:Apparently
App
inherits fromDelayedInit
, which means that whenmain()
runs,FAMILY
hasn't been initialized. Exactly what I don't want, so I'm going to stop usingApp
.I might be missing something, but I don't think you should be defining a
main
method. When you extendApp
, you inherit amain
, and you should not override it since that is what actually invokes the code in yourApp
.For example, the simple class in your answer should be written