Matrix Operation in Spark MLlib in Java

2019-04-11 19:25发布

问题:

This question is about MLlib (Spark 1.2.1+).

What is the best way to manipulate local matrices (moderate size, under 100x100, so does not need to be distributed).

For instance, after computing the SVD of a dataset, I need to perform some matrix operation. The RowMatrix only provide a multiply function. The toBreeze method returns a DenseMatrix<Object> but the API does not seem Java friendly: public final <TT,B,That> That $plus(B b, UFunc.UImpl2<OpAdd$,TT,B,That> op)

In Spark+Java, how to do any of the following operations:

  • transpose a matrix
  • add/subtract two matrices
  • crop a Matrix
  • perform element-wise operations
  • etc

Javadoc RowMatrix: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html

RDD<Vector> data = ...;
RowMatrix matrix = new RowMatrix(data);
SingularValueDecomposition<RowMatrix, Matrix> svd = matrix.computeSVD(15, true, 1e-9d);

RowMatrix U = svd.U();
Vector s = svd.s();
Matrix V = svd.V();
//Example 1: How to compute transpose(U)*matrix
//Example 2: How to compute transpose(U(:,1:k))*matrix

EDIT: Thanks for dlwh for pointing me in the right direction, the following solution works:

import no.uib.cipr.matrix.DenseMatrix;
// ...
RowMatrix U = svd.U();
DenseMatrix U_mtj = new DenseMatrix((int) U.numCols(), (int) U.numRows(), U.toBreeze().toArray$mcD$sp(), true);
// From there, matrix operations are available on U_mtj

回答1:

Breeze just doesn't provide a Java-friendly API. (And, speaking as the main author, I have no plans to: it would hamstring the API too much.)

You can probably exploit the fact that MTJ uses the same dense matrix representation as we do. (Well, almost. Their API doesn't expose majorStride, but that shouldn't be an issue for you.)

That is, you can do something like this:

import no.uib.cipr.matrix.DenseMatrix;

// ...

breeze.linalg.DenseMatrix[Double] Ubreeze = U.toBreeze();
new DenseMatrix(Ubreeze.cols(), Ubreeze.rows(), Ubreeze.data());