I'm trying to parallelize a simple update loop of a simulation on the GPU. Basically there are a bunch of "creatures" represented by circles that in each update loop will move and then there will be a check of whether any of them intersect. radii is the radius of the different types of creatures.
import numpy as np
import math
from numba import cuda
@cuda.jit('void(float32[:], float32[:], float32[:], uint8[:], float32[:], float32[:], float32, uint32, uint32)')
def update(p_x, p_y, radii, types, velocities, max_velocities, acceleration, num_creatures, cycles):
for c in range(cycles):
for i in range(num_creatures):
velocities[i] = velocities[i] + acceleration
if velocities[i] > max_velocities[i]:
velocities[i] = max_velocities[i]
p_x[i] = p_x[i] + (math.cos(1.0) * velocities[i])
p_y[i] = p_y[i] + (math.sin(1.0) * velocities[i])
for i in range(num_creatures):
for j in range(i, num_creatures):
delta_x = p_x[j] - p_x[i]
delta_y = p_y[j] - p_y[i]
distance_squared = (delta_x * delta_x) + (delta_y * delta_y)
sum_of_radii = radii[types[i]] + radii[types[i]]
if distance_squared < sum_of_radii * sum_of_radii:
pass
acceleration = .1
creature_radius = 10
spacing = 20
food_radius = 3
max_num_creatures = 1500
num_creatures = 0
max_num_food = 500
num_food = 0
max_num_entities = max_num_creatures + max_num_food
num_entities = 0
cycles = 1
p_x = np.zeros(max_num_entities, dtype=np.float32)
p_y = np.zeros(max_num_entities, dtype=np.float32)
radii = np.array([creature_radius, creature_radius, food_radius], dtype=np.float32)
types = np.zeros(max_num_entities, dtype=np.uint8)
velocities = np.zeros(max_num_creatures, dtype=np.float32)
max_velocities = np.zeros(max_num_creatures, dtype=np.float32)
# types:
# male - 0
# female - 1
# food - 2
for x in range(1, 800 // spacing):
for y in range(1, 600 // spacing):
if num_creatures % 2 == 0:
types[num_creatures] = 0
else:
types[num_creatures] = 1
p_x[num_creatures] = x * spacing
p_y[num_creatures] = y * spacing
max_velocities[num_creatures] = 5
num_creatures += 1
device_p_x = cuda.to_device(p_x)
device_p_y = cuda.to_device(p_y)
device_radii = cuda.to_device(radii)
device_types = cuda.to_device(types)
device_velocities = cuda.to_device(velocities)
device_max_velocities = cuda.to_device(max_velocities)
threadsperblock = 64
blockspergrid = 16
update[blockspergrid, threadsperblock](device_p_x, device_p_y, device_radii, device_types, device_velocities, device_max_velocities,
acceleration, num_creatures, cycles)
print(device_p_x.copy_to_host())
The 1.0 in math.cos and math.sin is just a placeholder for the directions of the individual creatures.
As it is now there are multiple threads, but they execute the same code. What changes do I have to make to the kernel to parallelize it?
The most obvious dimension for parallelization to me seems to be the loop in
i
in your kernel, that is iterating overnum_creatures
. So I'll describe how to do that.Our goal will be to remove the loop on
num_creatures
, and instead let each iteration of the loop be handled by a separate CUDA thread. This is possible because the work done in each loop iteration is (mostly) independent - it does not depend on the results of other loop iterations (but see 2 below).A challenge we will run into is that the 2nd
i
for-loop innum_creatures
presumably depends on the results of the first. If we leave everything as serial code running in a single thread, then that dependency is taken care of by the nature of serial code execution. However we want to parallelize this. Therefore we need a global sync in between the first for loop innum_creatures
and the 2nd. A simple, convenient global sync in CUDA is the kernel launch, so we'll break the kernel code into two kernel functions. We'll call theseupdate1
andupdate2
This then presents the challenge about what to do about the over-arching loop in
cycles
. We can't simply replicate that loop in both kernels, because that would change the functional behavior - we would computecycles
updates top_x
before computing a single calculation ofdelta_x
, for example. That is presumably not what is wanted. So, for simplicity, we'll hoist this loop out of the kernel code and back into host code. The host code will then callupdate1
andupdate2
kernels forcycles
iterations.We also want to make the kernel processing adaptable to different sizes of
num_creatures
. So we'll pick a hard-coded size forthreadsperblock
but we will make the number of blocks launched variable, based on the size ofnum_creatures
. To facilitate this, we need a thread-check (initial if-statement) in each of our kernels, so that "extra" threads don't do anything.With that description, we end up with something like this:
which produces the same output as the original posted version (the original code is running 16 blocks of 64 threads doing exactly the same thing, and stepping on each other as they write to the same data. So I'm referring to the original posted version but running one block of one thread).
Note that there are certainly other ways to parallelize, and probably other optimizations possible, but this should give you a sensible starting point for CUDA work.
As mentioned to you in your previous question, the second kernel here really doesn't do anything useful, but I assume that is just a placeholder for future work. I'm also not sure you'll get what you want with your usage of
radii
here, but that's also not the focus of this question.So what is the effect of all this performance wise? Again we will compare the original posted version (
t12.py
, below) running just one block of one thread (not 16 blocks of 64 threads, which would only be worse, anyway) against this version which happens to be running 18 blocks of 64 threads (t11.py
, below):We see that the profiler reports for the original
t12.py
version, that there is a singleupdate
kernel running, with 1 block and 1 thread, and it is taking 1.8453 milliseconds. For the modifiedt11.py
version, posted in this answer, the profiler reports 18 blocks of 64 threads each, for bothupdate1
andupdate2
kernels, and the combined execution time of these two kernels is approximately 5.47 + 1.12 = 6.59 microseconds.EDIT: Based on some discussion in the comments, it should be possible to combine both kernels into a single kernel, using a double-buffering scheme on
p_x
andp_y
:It is still necessary to preserve the kernel-calling loop in
cycles
in host code, since we still require the global sync provided by the kernel call. But for a given number ofcycles
, this will reduce the contribution of the kernel-call overhead.Using this technique, care must be taken in the choice of
cycles
as well as the use of data from either thep_x
orp_x_new
buffer, for coherent results.