Following code is from the quick start guide of Apache Spark. Can somebody explain me what is the "line" variable and where it comes from?
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
Also, how does a value get passed into a,b?
Link to the QSG http://spark.apache.org/docs/latest/quick-start.html
First, according to your link, the
textfile
is created assuch that
textfile
is aRDD[String]
meaning it is a resilient distributed dataset of typeString
. The API to access is very similar to that of regular Scala collections.So now what does this
map
do?Imagine you have a list of
String
s and want to convert that into a list of Ints, representing the length of each String.The
map
method expects a function. A function, that goes fromString => Int
. With that function, each element of the list is transformed. So the value of intlist isList( 2, 3, 1 )
Here, we have created an anonymous function from
String => Int
. That isx => x.length
. One can even write the function more explicit asIf you do use write the above explicit, you can
So, here it is absolutely evident, that stringLength is a function from
String
toInt
.Remark: In general,
map
is what makes up a so called Functor. While you provide a function from A => B,map
of the functor (here List) allows you use that function also to go fromList[A] => List[B]
. This is called lifting.Answers to your questions
As mentioned above,
line
is the input parameter of the functionline => line.split(" ").size
More explicit
(line: String) => line.split(" ").size
Example: If
line
is "hello world", the function returns 2.reduce
also expects a function from(A, A) => A
, whereA
is the type of yourRDD
. Lets call this functionop
.What does
reduce
. Example:Result here is the sum of the list elements, 10.
Remark: In general
reduce
calculatesFull example
First, textfile is a RDD[String], say
Map
andreduce
are methods of RDD class, which has interface similar to scala collections.What you pass to methods
map
andreduce
are actually anonymous function (with one param in map, and with two parameters in reduce).textFile
calls provided function for every element (line of text in this context) it has.Maybe you should read some scala collection introduction first.
You can read more about RDD class API here: https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.rdd.RDD
what map function does is, it takes the list of arguments and map it to some function. Similar to map function in python, if you are familiar.
Also, File is like a list of Strings. (not exactly but that's how it's being iterated)
Let's consider this is your file.
Now let's see how map function works.
We need two things,
list of values
which we already have andfunction
to which we want to map this values. let's consider really simple function for understanding.this function simply takes single String argument and prints on the console.
if we match this function to your list, it's gonna print all the lines
We can write an anonymous function as mentioned below as well, which does the same thing.
in your case,
line
is the first line of the file. you could change the argument name as you like. for example, in above example, if I changearg
toline
it would work without any issue