I have the following code:
val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
for (value <- data.getValues()) {
if (record.getEnum() == DataEnum.BLUE) {
blueCount += 1
println("Enum = BLUE : " + value.toString()
}
}
data
}.persist(StorageLevel.MEMORY_ONLY_SER)
output.saveAsTextFile("myOutput")
Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!
I was able to work it around by making a utility function:
This is a conceptual question...
Imagine You have a big cluster, composed of many workers let's say
n
workers and those workers store a partition of anRDD
orDataFrame
, imagine You start amap
task across that data, and inside thatmap
you have aprint
statement, first of all:Those are too many questions, thus the designers/maintainers of
apache-spark
decided logically to drop any support toprint
statements inside anymap-reduce
operation (this includeaccumulators
and evenbroadcast
variables).This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?
In order to prove this you can run this scala code for example: