Spark Matrix multiplication with python

2020-02-29 03:52发布

I am trying to do matrix multiplication using Apache Spark and Python.

Here is my data

from pyspark.mllib.linalg.distributed import RowMatrix

My RDD of vectors

rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])

My maxtrix

mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)

I would like to do something like this:

mat = mat1 * mat2

I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:

def matrix_multiply(df1, df2):
    nb_row = df1.count()    
    mat=[]
    for i in range(0, nb_row):
        row=list(df1.filter(df1['index']==i).take(1)[0])
        row_out = []
        for r in range(0, len(row)):
            r_value = 0
            col = df2.select(df2[list_col[r]]).collect()
            col = [list(c)[0] for c in col]
            for c in range(0, len(col)): 
                r_value += row[c] * col[c]
            row_out.append(r_value)            
        mat.append(row_out)
    return mat

My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time? If someone have another idea it will be helpful for me.

标签： apache-spark pyspark apache-spark-mllib

1条回答

Lonely孤独者°

2楼-- · 2020-02-29 04:08

You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))

0人赞添加讨论(0) 举报

Spark Matrix multiplication with python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间