I have a list of matrices L
, where each item M
is a x*n
matrix (x
is a variable, n
is a constant).
I want to compute the sum of M'*M
for all items in L
(M'
is the transpose of M
) as the following Python code does:
for M in L:
res += np.dot(M.T, M)
Actually I want to implement this in Theano (which doesn't support variable length multidimensional arrays), and I don't want to pad all matrices to the same size because that will waste too much space (some of the matrices can be very large).
Is there a better way to do this?
Edit:
L
is known before the Theano compilation.
Edit:
received two excellent answers from @DanielRenshaw and @Divakar, emotionally difficult to choose one to accept.
You can just pad the input arrays along the first axis that is all the x
's add up. Thus, we would end up with a tall (X,n)
array, where X =x1+x2+x3+....
. This can be transposed and its dot product with its self would be the desired output of shape (n,n)
. All of this is achieved with a pure vectorized solution leveraging the powerful dot product. Thus, the implementation would be -
# Concatenate along axis=0
Lcat = np.concatenate(L,axis=0)
# Perform dot product of the transposed version with self
out = Lcat.T.dot(Lcat)
Runtime tests and verify output -
In [116]: def vectoized_approach(L):
...: Lcat = np.concatenate(L,axis=0)
...: return Lcat.T.dot(Lcat)
...:
...: def original_app(L):
...: n = L[0].shape[1]
...: res = np.zeros((n,n))
...: for M in L:
...: res += np.dot(M.T, M)
...: return res
...:
In [117]: # Input
...: L = [np.random.rand(np.random.randint(1,9),5)for iter in range(1000)]
In [118]: np.allclose(vectoized_approach(L),original_app(L))
Out[118]: True
In [119]: %timeit original_app(L)
100 loops, best of 3: 3.84 ms per loop
In [120]: %timeit vectoized_approach(L)
1000 loops, best of 3: 632 µs per loop
Given that the number of matrices is known before the Theano compilation needs to happen, one can simply use regular Python lists of Theano matrices.
Here's a complete example showing the difference between numpy and Theano versions.
This code has been updated to include a comparison with @Divakar's vectorized approach which performs better. Two vectorized approaches are possible for Theano, one where Theano performs the concatenation, and one where numpy does the concatenation the result of which is then passed to Theano.
import timeit
import numpy as np
import theano
import theano.tensor as tt
def compile_theano_version1(number_of_matrices, n, dtype):
assert number_of_matrices > 0
assert n > 0
L = [tt.matrix() for _ in xrange(number_of_matrices)]
res = tt.zeros(n, dtype=dtype)
for M in L:
res += tt.dot(M.T, M)
return theano.function(L, res)
def compile_theano_version2(number_of_matrices):
assert number_of_matrices > 0
L = [tt.matrix() for _ in xrange(number_of_matrices)]
concatenated_L = tt.concatenate(L, axis=0)
res = tt.dot(concatenated_L.T, concatenated_L)
return theano.function(L, res)
def compile_theano_version3():
concatenated_L = tt.matrix()
res = tt.dot(concatenated_L.T, concatenated_L)
return theano.function([concatenated_L], res)
def numpy_version1(*L):
assert len(L) > 0
n = L[0].shape[1]
res = np.zeros((n, n), dtype=L[0].dtype)
for M in L:
res += np.dot(M.T, M)
return res
def numpy_version2(*L):
concatenated_L = np.concatenate(L, axis=0)
return np.dot(concatenated_L.T, concatenated_L)
def main():
iteration_count = 100
number_of_matrices = 20
n = 300
min_x = 400
dtype = 'float64'
theano_version1 = compile_theano_version1(number_of_matrices, n, dtype)
theano_version2 = compile_theano_version2(number_of_matrices)
theano_version3 = compile_theano_version3()
L = [np.random.standard_normal(size=(x, n)).astype(dtype)
for x in range(min_x, number_of_matrices + min_x)]
start = timeit.default_timer()
numpy_res1 = np.sum(numpy_version1(*L)
for _ in xrange(iteration_count))
print 'numpy_version1', timeit.default_timer() - start
start = timeit.default_timer()
numpy_res2 = np.sum(numpy_version2(*L)
for _ in xrange(iteration_count))
print 'numpy_version2', timeit.default_timer() - start
start = timeit.default_timer()
theano_res1 = np.sum(theano_version1(*L)
for _ in xrange(iteration_count))
print 'theano_version1', timeit.default_timer() - start
start = timeit.default_timer()
theano_res2 = np.sum(theano_version2(*L)
for _ in xrange(iteration_count))
print 'theano_version2', timeit.default_timer() - start
start = timeit.default_timer()
theano_res3 = np.sum(theano_version3(np.concatenate(L, axis=0))
for _ in xrange(iteration_count))
print 'theano_version3', timeit.default_timer() - start
assert np.allclose(numpy_res1, numpy_res2)
assert np.allclose(numpy_res2, theano_res1)
assert np.allclose(theano_res1, theano_res2)
assert np.allclose(theano_res2, theano_res3)
main()
When run this prints (something like)
numpy_version1 1.47830819649
numpy_version2 1.77405482179
theano_version1 1.3603150303
theano_version2 1.81665318145
theano_version3 1.86912039489
The asserts pass, showing that the Theano and numpy versions both compute the same result to high degree of accuracy. Clearly this accuracy will reduce if using float32
instead of float64
.
The timing results show that the vectorized approach may not be preferable, it depends on the matrix sizes. In the example above the matrices are large and the non-concatenation approach is faster but if the n
and min_x
parameters are changed in the main
function to be much smaller then the concatenation approach is quicker. Other results may hold when running on a GPU (Theano versions only).
In addition to @DanielRenshaw's answer, if we increase the number of matrices to 1000, the compile_theano_version1
function will yield RuntimeError: maximum recursion depth exceeded
, and compile_theano_version2
seems to take forever to compile.
There's a fix for this by using typed_list
:
def compile_theano_version4(number_of_matrices, n):
import theano.typed_list
L = theano.typed_list.TypedListType(tt.TensorType(theano.config.floatX, broadcastable=(None, None)))()
res, _ = theano.scan(fn=lambda i: tt.dot(L[i].T, L[i]),
sequences=[theano.tensor.arange(number_of_matrices, dtype='int64')])
return theano.function([L], res.sum(axis=0))
Moreover, I set the data type of all relevant variables to float32
and ran @DanielRenshaw's script on GPU, it turned out that @Divakar's suggestion (theano_version3
) is the most efficient in this case. Though as @DanielRenshaw said, using a huge matrix may not always be a good practice.
The followings are settings and outputs on my machine.
iteration_count = 100
number_of_matrices = 200
n = 300
min_x = 20
dtype = 'float32'
theano.config.floatX = dtype
numpy_version1 5.30542397499
numpy_version2 3.96656394005
theano_version1 5.26742005348
theano_version2 1.76983904839
theano_version3 1.03577589989
theano_version4 5.58366179466