I'm trying to figure out if it's possible to efficiently calculate the conditional entropy of a set of numbers using CUDA. You can calculate the conditional entropy by dividing an array into windows, then counting the number of matching subarrays/substrings for different lengths. For each subarray length, you calculate the entropy by adding together the matching subarray counts times the log of those counts. Then, whatever you get as the minimum entropy is the conditional entropy.
To give a more clear example of what I mean, here is full calculation:
The initial array is [1,2,3,5,1,2,5]. Assuming the window size is 3, this must be divided into five windows: [1,2,3], [2,3,5], [3,5,1], [5,1,2], and [1,2,5].
Next, looking at each window, we want to find the matching subarrays for each length.
The subarrays of length 1 are [1],[2],[3],[5],[1]. There are two 1s, and one of each other number. So the entropy is log(2)2 + 4(log(1)*1) = 0.6.
The subarrays of length 2 are [1,2], [2,3], [3,5], [5,1], and [1,2]. There are two [1,2]s, and four unique subarrays. The entropy is the same as length 1, log(2)2 + 4(log(1)*1) = 0.6.
The subarrays of length 3 are the full windows: [1,2,3], [2,3,5], [3,5,1], [5,1,2], and [1,2,5]. All five windows are unique, so the entropy is 5*(log(1)*1) = 0.
The minimum entropy is 0, meaning it is the conditional entropy for this array.
This can also be presented as a tree, where the counts at each node represent how many matches exist. The entropy for each subarray length is equivalent to the entropy for each level of the tree.
If possible, I'd like to perform this calculation on many arrays at once, and also perform the calculation itself in parallel. Does anyone have suggestions on how to accomplish this? Could thrust be useful? Please let me know if there is any additional information I should provide.
I tried solving your problem using thrust. It works, but it results in a lot of thrust calls. Since your input size is rather small, you should process multiple arrays in parallel. However, doing this results in a lot of book-keeping effort, you will see this in the following code.
Your input range is limited to
[1,5]
, which is equivalent to[0,4]
. The general idea is that (theoretically) any tuple out of this range (e.g.{1,2,3}
can be represented as a number in base 4 (e.g.1+2*4+3*16 = 57
). In practice we are limited by the size of the integer type. For a 32bit unsigned integer this will lead to a maximum tuple size of16
. This is also the maximum window size the following code can handle (changing to a 64bit unsigned integer will lead to a maximum tuple size of32
).Let's assume the input data is structured like this: We have
2
arrays we want to process in parallel, each array is of size5
and window size is3
.We can now generate all windows:
Using a per tuple prefix sum and applying the aforementioned representation of each tuple as a single base-4 number, we get:
Now we reorder the values so we have the numbers which represent a subarray of a specific length next to each other:
We then sort within each group:
Now we can count how often a number occurs in each group:
Applying the log-formula results in
Now we fetch the minimum value per array:
compile using
This configuration takes 133 milliseconds on my GPU (GTX 680), so around 0.1 milliseconds per array.
The implementation can definitely be optimized, e.g. using a precomputed lookup table for the base-4 conversion and maybe some of the thrust calls can be avoided.