Lets say I create a theano function, how do I run operations in parallel elementwise on theano tensors like on matrices?
# This is in theano function. Instead of for loop, I'd like to run this in parallel
c = np.asarray(shape=(2,200))
for n in range(0,20):
# some example in looping this is arbitrary and doesn't matter
c[0][n] = n % 20
c[1][n] = n / 20
# in cuda, we normally use an if statement
# if (threadIdx.x === some_index) { c[0][n] = some_value; }
The question should be reformed, how do I do parallel operations in a Theanos function? I've looked at http://deeplearning.net/software/theano/tutorial/multi_cores.html#parallel-element-wise-ops-with-openmp which only talks about adding a setting, but does not explain how an operation is parallelized for element wise operations.
To an extent, Theano expects you to focus more on what you want computed rather than on how you want it computed. The idea is that the Theano optimizing compiler will automatically parallelize as much as possible (either on GPU or on CPU using OpenMP).
The following is an example based on the original post's example. The difference is that the computation is declared symbolically and, crucially, without any loops. Here one is telling Theano that the results should be a stack of tensors where the first tensor is the values in a range modulo the range size and the second tensor is the elements of the same range divided by the range size. We don't say that a loop should occur but clearly at least one will be required. Theano compiles this down to executable code and will parallelize it if it makes sense.
You need to be able to specify your computation in terms of Theano operations. If those operations can be parallelized on the GPU, they should be parallelized automatically.