I'm relatively new to Python and I've got a nested for loop. Since the for loops take a while to run, I'm trying to figure out a way to vectorize this code so it can run faster.
In this case, coord is a 3-dimensional array where coord[x, 0, 0] and coord[x, 0, 1] are integers and coord[x, 0, 2] is either 0 or 1. H is a SciPy sparse matrix and x_dist, y_dist, z_dist, and a are all floats.
# x_dist, y_dist, and z_dist are floats
# coord is a num x 1 x 3 numpy array where num can go into the hundreds of thousands
num = coord.shape[0]
H = sparse.lil_matrix((num, num))
for i in xrange(num):
for j in xrange(num):
if (np.absolute(coord[i, 0, 0] - coord[j, 0, 0]) <= 2 and
(np.absolute(coord[i, 0, 1] - coord[j, 0, 1]) <= 1)):
x = ((coord[i, 0, 0] * x_dist + coord[i, 0, 2] * z_dist) -
(coord[j, 0, 0] * x_dist + coord[j, 0, 2] * z_dist))
y = (coord[i, 0, 1] * y_dist) - (coord[j, 0, 1] * y_dist)
if a - 0.5 <= np.sqrt(x ** 2 + y ** 2) <= a + 0.5:
H[i, j] = -2.7
I've also read that broadcasting with NumPy, while much faster, can lead to large amounts of memory usage from temporary arrays. Would it be better to go the vectorization route or try and use something like Cython?
This is how I would vectorize your code, some discussion on the caveats later:
As you noted, the issues are going to come with the first
idx
array, as it will be of shape(num, num)
, so it will probably blow your memory to pieces ifnum
is "into the hundreds of thousands."One potential solution is to break down your problem into manageable chunks. If you have a 100,000 element array, you can split it into 100 chunks of 1,000 elements, and run a modified version of the code above for each of the 10,000 combinations of chunks. You would only need a 1,000,000 element
idx
array (which you could pre-allocate and reuse for better performance), and you would have a loop of only 10,000 iterations, instead of the 10,000,000,000 of your current implementation. It is sort of a poor man's parallelization scheme, which you can actually improve on by having several of those chunks processed in parallel if you have a multi-core machine.The nature of the calculation makes it tough to vectorize with numpy methods I'm familiar with. I think the best solution in terms of speed and memory usage would be cython. However, you can get some speed-up using numba. Here is an example (note that normally you use
autojit
as a decorator):I get this output: