I have a list of matrices L
, where each item M
is a x*n
matrix (x
is a variable, n
is a constant).
I want to compute the sum of M'*M
for all items in L
(M'
is the transpose of M
) as the following Python code does:
for M in L:
res += np.dot(M.T, M)
Actually I want to implement this in Theano (which doesn't support variable length multidimensional arrays), and I don't want to pad all matrices to the same size because that will waste too much space (some of the matrices can be very large).
Is there a better way to do this?
Edit:
L
is known before the Theano compilation.
Edit:
received two excellent answers from @DanielRenshaw and @Divakar, emotionally difficult to choose one to accept.
Given that the number of matrices is known before the Theano compilation needs to happen, one can simply use regular Python lists of Theano matrices.
Here's a complete example showing the difference between numpy and Theano versions.
This code has been updated to include a comparison with @Divakar's vectorized approach which performs better. Two vectorized approaches are possible for Theano, one where Theano performs the concatenation, and one where numpy does the concatenation the result of which is then passed to Theano.
When run this prints (something like)
The asserts pass, showing that the Theano and numpy versions both compute the same result to high degree of accuracy. Clearly this accuracy will reduce if using
float32
instead offloat64
.The timing results show that the vectorized approach may not be preferable, it depends on the matrix sizes. In the example above the matrices are large and the non-concatenation approach is faster but if the
n
andmin_x
parameters are changed in themain
function to be much smaller then the concatenation approach is quicker. Other results may hold when running on a GPU (Theano versions only).In addition to @DanielRenshaw's answer, if we increase the number of matrices to 1000, the
compile_theano_version1
function will yieldRuntimeError: maximum recursion depth exceeded
, andcompile_theano_version2
seems to take forever to compile.There's a fix for this by using
typed_list
:Moreover, I set the data type of all relevant variables to
float32
and ran @DanielRenshaw's script on GPU, it turned out that @Divakar's suggestion (theano_version3
) is the most efficient in this case. Though as @DanielRenshaw said, using a huge matrix may not always be a good practice.The followings are settings and outputs on my machine.
You can just pad the input arrays along the first axis that is all the
x
's add up. Thus, we would end up with a tall(X,n)
array, whereX =x1+x2+x3+....
. This can be transposed and its dot product with its self would be the desired output of shape(n,n)
. All of this is achieved with a pure vectorized solution leveraging the powerful dot product. Thus, the implementation would be -Runtime tests and verify output -