Improving random memory access when random access

The basic concept of what I am doing

Complete coalition structure formation problem/Combinatorial auctions.

Given a set of N agents, which disjoint subsets of the set of agents yields the best outcome.

E.g. Agents = {a,b} and their values

{a} = 2
{b} = 3
{a,b} = 4

In this instance the coalition of {{a},{b}} = 5 would give the best outcome, where it is the pairwise disjoint subset of {a,b}.

So in short the problem is about splitting a set and check if any of the splittings sum is greater than the original set. (This part is as good as it goes in terms of generating splittings and how I represent my data.)

How I represent the data is basicly an array of values, where the index is the set configuration (a set is represented as an integer).

For the example above:

int val[4] = {0,2,3,4} where set

{a} = 01 = index 1
{b} = 10 = index 2
{a,b} = 11 = index 3

So the set acts as an index in the value array.

The problem is that this gives random memory access, given a large number of agents for which there are 2^n values to consider.

SKIP HERE FOR THE MEMORY ACCESS PROBLEM

E.g. in the environment of 20 agents, given the set of two elements 10000000000000000001 the value for element 10000000000000000000 is 1048575 indexes away from 00000000000000000001 which is 4194300 bytes away in memory, which represent 32767 cachlines of 128 bytes of distance. Hence, horrible access pattern.

The solutions I have tried looking for involve ordering the indexes in terms of cardinality/Hamming weight:

Hamming weight based indexing

Determin the lexicographic distance between two integers

suffer from arithmetic overhead that to my calulations will overshadow the penalty from random access. Especially Hamming weight based indexing as it uses many dependant calculations which gets serilized and blocks the thread.

As a last resort I ask here again, is there any method(I know about coalescing, which is hard to do for this problem) to improve the access times when your solution is depandant on random access, or is it a helpless state?

Currently I am around 45% instruction replay overhead becouse of that, and changing the blocksize gives no notable result. And yes I try hide the latency by calculating several coalitions per thread, which work somewhat.