-->

Efficient all-pairs set intersection on GPU

2020-07-22 19:30发布

问题:

I have n sets, subsets of a finite universe. I want to calculate the n*n matrix in which the (I, J) entry contains the cardinality of the intersection of set I and set J. n is in the order of 50000.

My idea is to split the matrix into blocks sufficiently small so to have one thread per entry. Every thread should calculate the intersection using bitwise and.

Are there more efficient approaches to solve this problem?

回答1:

I'm going to assume you want to compute it as you described: actually computing the intersection of every pair of sets, using bitwise and of bitsets.

With the right mathematical setup, you are literally computing the outer product of two vectors, so I will think in terms of high performance linear algebra.

The key to performance is going to be reducing memory traffic, and that means holding things in registers when you can. The overwhelmingly most significant factor is that your elements are huge; it takes 6250 32-bit words to store a single set! An entire multiprocessor of cuda compute capability 3.0, for example, can only hold a mere 10 sets in registers.

What you probably want to do is to spread each element out across an entire thread block. With 896 threads in a block and 7 registers per block, you can store one set of 200704 elements. With cuda compute capability 3.0, you will have 36 registers available per block.

The simplest implementation would be to have each block own one row of the output matrix. It loads the corresponding element of the second vector and stores it in registers, and then iterates over all of the elements of the first vector, computing the intersection, computing and reducing the popcount, and then storing the result in the output vector.

This optimization should reduce the overall number of memory reads by a factor of 2, and thus is likely to double performance.

Better would be to have each block own 3-4 rows of the output matrix at once, and loads the corresponding 3-4 elements of the second vector into registers. Then the block iterates over all of the elements of the first register, and for each it computes the 3-4 intersections it can, storing the result in the output matrix.

This optimization reduces the memory traffic by an additional factor of 3-4.



回答2:

A completely different approach would be to work with each element of the universe individually: for each element of the universe, you compute which sets actually contain that element, and then (atomically) increment the corresponding entries of the output matrix.

Asymptotically, this should be much more efficient than computing the intersections of sets. Unfortunately, it sounds hard to implement efficiently.


An improvement is to work with, say, 4 elements of the universe at a time. You split up all of your sets up into 16 buckets, depending on which of those 4 elements the set contains. Then, for each of the 16*16 possible pairs of buckets, you iterate through all pairs of vectors from the buckets and (atomically) update the corresponding entry of the matrix appropriately.

This should be even faster than the version described above, but it still may potentially be difficult to implement.


To reduce the difficulty of getting all of the synchronization worked out, you could partition all of input sets into k groups of n/k sets each. Then, the (i,j)-th thread (or warp or block) only does the updates for the corresponding block of the output matrix.



回答3:

A different approach to breaking up the problem is to to split the universe into smaller partitions of 1024 elements each, and compute just the size of the intersections in this part of the universe.

I'm not sure if I've described that well; basically you're computing

A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)

where v and w are the two lists of sets, and k in S is 1 if true and 0 otherwise. The point is to permute the indices so that k is in the outer loop rather than the inner loop, although for efficiency you will have to work with many consecutive k at once, rather than one at a time.

That is, you initialize the output matrix to all zeroes, and for each block of 1024 universe elements, you compute the sizes of the intersections and accumulate the results into the output matrix.

I choose 1024, because I imagine you'll have a data layout where that's probably the smallest size where you can still get the full memory bandwidth when reading from device memory, and all of the threads in warp work together. (adjust this appropriately if you know better than me, or you aren't using nVidia and whatever other GPUs you're using would work with something better)

Now that your elements are a reasonable size, you can now appeal to traditional linear algebra optimizations to compute this product. I would probably do the following:

Each warp is assigned a large number of rows of the output matrix. It reads the corresponding elements out of the second vector, and then iterates through the first vector, computing products.

You could have all of the warps operate independently, but it may be better to do the following:

  • All of the warps in a block work together to load some number of elements from the first vector
  • Each warp computes the intersections it can and writes the results to the output matrix

You could store the loaded elements in shared memory, but you might get better results holding them in registers. Each warp can only compute the intersections with the set elements its holding onto, and you but after doing so the warps can all rotate which warps are holding which elements.

If you do enough optimizations along these lines, you will probably reach the point where you are no longer memory bound, which means you might not have to go so far as to do the most complicated optimizations (e.g. the shared memory approach described above might already be enough).