The following code is written in python and it works, i.e. returns the expected result. However, it is very slow and I believe that can be optimized.
G_tensor = numpy.matlib.identity(N_particles*3,dtype=complex)
for i in range(N_particles):
for j in range(i, N_particles):
if i != j:
#Do lots of things, here is shown an example.
# However you should not be scared because
#it only fills the G_tensor
R = numpy.linalg.norm(numpy.array(positions[i])-numpy.array(positions[j]))
rx = numpy.array(positions[i][0])-numpy.array(positions[j][0])
ry = numpy.array(positions[i][1])-numpy.array(positions[j][1])
rz = numpy.array(positions[i][2])-numpy.array(positions[j][2])
krq = (k*R)**2
pf = -k**2*alpha*numpy.exp(1j*k*R)/(4*math.pi*R)
a = 1.+(1j*k*R-1.)/(krq)
b = (3.-3.*1j*k*R-krq)/(krq)
G_tensor[3*i+0,3*j+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*i+1,3*j+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*i+2,3*j+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*i+0,3*j+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*i+0,3*j+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*i+1,3*j+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*i+1,3*j+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*i+2,3*j+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*i+2,3*j+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
G_tensor[3*j+0,3*i+0] = pf*(a + b * (rx*rx)/(R**2)) #Gxx
G_tensor[3*j+1,3*i+1] = pf*(a + b * (ry*ry)/(R**2)) #Gyy
G_tensor[3*j+2,3*i+2] = pf*(a + b * (rz*rz)/(R**2)) #Gzz
G_tensor[3*j+0,3*i+1] = pf*(b * (rx*ry)/(R**2)) #Gxy
G_tensor[3*j+0,3*i+2] = pf*(b * (rx*rz)/(R**2)) #Gxz
G_tensor[3*j+1,3*i+0] = pf*(b * (ry*rx)/(R**2)) #Gyx
G_tensor[3*j+1,3*i+2] = pf*(b * (ry*rz)/(R**2)) #Gyz
G_tensor[3*j+2,3*i+0] = pf*(b * (rz*rx)/(R**2)) #Gzx
G_tensor[3*j+2,3*i+1] = pf*(b * (rz*ry)/(R**2)) #Gzy
Do you know how can I parallelize it? You should note that the two loops are not symmetric.
Edit one: A numpythonic solution was presented above and I made a comparison between the c++ implementation, my loop version in python and thr numpythonic. Results are the following: - c++ = 0.14seg - numpythonic version = 1.39seg - python loop version = 46.56seg Probably results can get better if we use the intel version of numpy.
Here is a proposition that should now work (I corrected a few mistakes) but that nonetheless sould give you the general idea of how verctorization can be applied to your code in order to make efficient use of numpy arrays. Everything is build in "one-pass" (ie without any for-loops) which is the "numpythonic" way:
Python is not a fast language. Number crunching with python should always use for time critical parts code written in a compiled language. With compilation down to the CPU level you can speed up the code by a factor up to 100 and then still go for parallelization. So I would not look down to using more cores doing inefficient stuff, but to work more efficient. I see the following ways to speed up the code:
1) Better use of numpy: Can you do your calculations instead on scalar level directly on vector/matrix level? eg. rx = positions[:,0]-positions[0,:] (not checked if that is correct) but something along those lines.
If that is not possible with your kind of calculations, than you can go for option 2 or 3
2) Use cython. Cython compiles Python code to C, which is then compiled to your CPU. By using static typing at the right places you can make your code much faster, see cython tutorials eg.: http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html
3) If you are familiar with FORTRAN, it might be a good idea to write just this part in FORTRAN and then call it from Python using f2py. In fact, your code looks a lot like FORTRAN anyway. For C and C++ SWIG is one great tool to make compiled code available in Python, but there are plenty of other techniques (cython, Boost::Python, ctypes, numba etc.)
When you have done this, and it is still to slow, using GPU power with pyCUDA or parallelization with mpi4py or multiprocessing might be an option.