I need addition of two matrices that are stored in two files.
The content of latest1.txt
and latest2.txt
has the next str:
1 2 3 4 5 6 7 8 9
I am reading those files as follows:
scala> val rows = sc.textFile(“latest1.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}
scala> val r1 = rows
r1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14
scala> val rows = sc.textFile(“latest2.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble)
Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}
scala> val r2 = rows
r2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14
I want to add r1, r2. So, Is there any way to add this two RDD[mllib.linalg.Vector]
s in Apache-Spark.
The following code exposes asBreeze and fromBreeze methods from Spark. This solution supports
SparseVector
in contrast to usingvector.toArray
. Note that Spark may change their API in the future and already has renamedtoBreeze
toasBreeze
.With this you can do
df.withColumn("xy", addVectors($"x", $"y"))
.This is actually a good question. I work with mllib regularly and did not realize these basic linear algebra operations are not easily accessible.
The point is that the underlying breeze vectors have all of the linear algebra manipulations you would expect - including of course basic element wise addition that you specifically mentioned.
However the breeze implementation is hidden from the outside world via:
So then, from the outside world/public API perspective, how do we access those primitives?
Some of them are already exposed: e.g. sum of squares:
However the selection of such available methods is limited - and in fact does not include the basic operations including element wise addition, subtraction, multiplication, etc.
So here is the best I could see:
Here is some sample code: