I am trying to do matrix multiplication using Apache Spark and Python.
Here is my data
from pyspark.mllib.linalg.distributed import RowMatrix
My RDD of vectors
rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])
My maxtrix
mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)
I would like to do something like this:
mat = mat1 * mat2
I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:
def matrix_multiply(df1, df2):
nb_row = df1.count()
mat=[]
for i in range(0, nb_row):
row=list(df1.filter(df1['index']==i).take(1)[0])
row_out = []
for r in range(0, len(row)):
r_value = 0
col = df2.select(df2[list_col[r]]).collect()
col = [list(c)[0] for c in col]
for c in range(0, len(col)):
r_value += row[c] * col[c]
row_out.append(r_value)
mat.append(row_out)
return mat
My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time? If someone have another idea it will be helpful for me.
You cannot. Since
RowMatrix
has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure isBlockMatrix
.