Find an integer not among four billion given ones

2019-01-02 16:28发布

It is an interview question:

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.

My analysis:

The size of the file is 4×109×4 bytes = 16 GB.

We can do external sorting, thus we get to know the range of the integers. My question is what is the best way to detect the missing integer in the sorted big integer sets?

My understanding(after reading all answers):

Assuming we are talking about 32-bit integers. There are 2^32 = 4*109 distinct integers.

Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory. Solution: if we use one bit representing one distinct integer, it is enough. we don't need sort. Implementation:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits

Solution: For all possible 16-bit prefixes, there are 2^16 number of
integers = 65536, we need 2^16 * 4 * 8 = 2 million bits. We need build
65536 buckets. For each bucket, we need 4 bytes holding all possibilities because
 the worst case is all the 4 billion integers belong to the same bucket.

step1: Build the counter of each bucket through the first pass through the file.
step2: Scan the buckets, find the first one who has less than 65536 hit.
step3: Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
step4: Scan the buckets built in step3, find the first bucket which doesnt
have a hit.

The code is very similar to above one.

Conclusion: We decrease memory through increasing file pass.


A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file -- at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing. Sorry.

标签: algorithm
30条回答
路过你的时光
2楼-- · 2019-01-02 16:40

If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:

The Theory

Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.

The Proof

Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:

  • 0 - 00
  • 1 - 01
  • 2 - 10
  • 3 - 11

Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).

Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.

Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.

The Algorithm For 0 Based Ranges

This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.

long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
    result = result ^ supplied;
}
return result;

The Algorithm For Arbitrary Based Ranges

This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).

long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
    if (supplied < offset) {
        offset = supplied;
    }
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
    result = result ^ (supplied - offset);
}
return result + offset;

Arbitrary Ranges

We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:

long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
    if (supplied < offset) {
        offset = supplied;
    }
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
    n++;
    result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing 
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
    result = result ^ (n++);
}
return result + offset;

Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.

Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):

function find($array) {
    $offset = min($array);
    $n = 0;
    $result = 0;
    foreach ($array as $value) {
        $result = $result ^ ($value - $offset);
        $n++;
    }
    $n++; // This takes care of the missing value
    while ($n == 1 || 0 != ($n & ($n - 1))) {
        $result = $result ^ ($n++);
    }
    return $result + $offset;
}

Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.

Another Approach

Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:

long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
    if (supplied != last + 1) {
        return last + 1;
    }
    last = supplied;
}
// The range is contiguous, so what do we do here?  Let's return last + 1:
return last + 1;
查看更多
萌妹纸的霸气范
3楼-- · 2019-01-02 16:41

Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.

Of course you'd need a backup plan in case n+1 causes an integer overflow.

查看更多
爱死公子算了
4楼-- · 2019-01-02 16:42

For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.

查看更多
骚的不知所云
5楼-- · 2019-01-02 16:43

Bit Elimination

One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:

long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
    val = val & ~fileVal;
    if (val == 0) error;
}

Bit Counts

Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.

Range Logic

Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:

struct Range
{
  long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };

Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.

查看更多
永恒的永恒
6楼-- · 2019-01-02 16:45

If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:

>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5

(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)

查看更多
君临天下
7楼-- · 2019-01-02 16:45

As long as we're doing creative answers, here is another one.

Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed). Read through the sorted file and output the first number that is missing.

查看更多
登录 后发表回答