I want to solve the following problem. I have to sample among an extremely large set, of the order of 10^20 and extracting a sample without repetitions of size about 10%-20%. Given the size of the set, I believe that an algorithm like Fisher–Yates is not feasible.
I'm thinking that something like random path tree might work for doing it in O(n log n) and can't be done faster, but I want to ask if something like this has already been implemented.
Thank you for your time!
I don't know how well the technique I describe below would do on formal tests of randomness, but it does give "random-looking" results.
You can do this with a multiplicative inverse. The idea is that you use a mathematical function to map every integer in the range 1-N to a unique integer in the same range. This is often used to generate obfuscated keys, but you can adapt it to generate random subsets by altering the seed value and the range from which you pull items.
A while back I wrote a blog entry about how to generate obfuscated sequential keys. Here's the code:
private void DoIt()
{
const long m = 101; // Number of keys + 1
const long x = 387420489; // must be coprime to m
// Compute the multiplicative inverse
var multInv = MultiplicativeInverse(x, m);
// HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
var nums = new HashSet<long>();
// Obfuscate each number from 1 to 100.
// Show that the process can be reversed.
// Show that no duplicates are generated.
for (long i = 1; i <= 100; ++i)
{
var obfuscated = i * x % m;
var original = obfuscated * multInv % m;
Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
if (!nums.Add(obfuscated))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a / b;
var r = a % b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q * t);
}
The first few lines of output for that program are:
1 => 43 => 1
2 => 86 => 2
3 => 28 => 3
4 => 71 => 4
5 => 13 => 5
6 => 56 => 6
7 => 99 => 7
8 => 41 => 8
9 => 84 => 9
10 => 26 => 10
If you were to change the m
and x
values at the beginning of the function to reflect your range of numbers, this would work for you. And rather than always starting at 1 and grabbing the first 10 or 20%, you could start at the 50% mark and go from there. Or use some technique that grabs every fifth number, or whatever, so long as your method doesn't visit the same number twice.
And if you need more runs, just change the x
value.
Generating the multiplicative inverse (think of it as seeding the random number generator) is an O(log n) operation. After that, generating each number is O(1).
Of course, if you're working with numbers in the range of 10^20, you'll have to modify the code to work with a big integer class.