Sorting 1 million 8-digit numbers in 1 MB of RAM

2020-02-16 05:24发布

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection.

The list of numbers may contain duplicates, which I must not discard. The code will be placed in ROM, so I need not subtract the size of my code from the 1 MB. I already have code to drive the Ethernet port and handle TCP/IP connections, and it requires 2 KB for its state data, including a 1 KB buffer via which the code will read and write data. Is there a solution to this problem?

Sources Of Question And Answer:

slashdot.org

cleaton.net

30条回答
贪生不怕死
2楼-- · 2020-02-16 05:43

I have a computer with 1M of RAM and no other local storage

Another way to cheat: you could use non-local (networked) storage instead (your question does not preclude this) and call a networked service that could use straightforward disk-based mergesort (or just enough RAM to sort in-memory, since you only need to accept 1M numbers), without needing the (admittedly extremely ingenious) solutions already given.

This might be cheating, but it's not clear whether you are looking for a solution to a real-world problem, or a puzzle that invites bending of the rules... if the latter, then a simple cheat may get better results than a complex but "genuine" solution (which as others have pointed out, can only work for compressible inputs).

查看更多
混吃等死
3楼-- · 2020-02-16 05:44

As ROM size does not count one does not need any additional RAM besides the TCP buffers. Just implement a big finite-state machine. Each state represents a multi-set of numbers read in. After reading a million numbers one has just to print the numbers corresponding to the reached state.

查看更多
来,给爷笑一个
4楼-- · 2020-02-16 05:46

There is one solution to this problem across all possible inputs. Cheat.

  1. Read m values over TCP, where m is near the max that can be sorted in memory, maybe n/4.
  2. Sort the 250,000 (or so) numbers and output them.
  3. Repeat for the other 3 quarters.
  4. Let the receiver merge the 4 lists of numbers it has received as it processes them. (It's not much slower than using a single list.)
查看更多
Ridiculous、
5楼-- · 2020-02-16 05:46

You just need to store the differences between the numbers in sequence, and use an encoding to compress these sequence numbers. We have 2^23 bits. We shall divide it into 6bit chunks, and let the last bit indicate whether the number extends to another 6 bits (5bits plus extending chunk).

Thus, 000010 is 1, and 000100 is 2. 000001100000 is 128. Now, we consider the worst cast in representing differences in sequence of a numbers up to 10,000,000. There can be 10,000,000/2^5 differences greater than 2^5, 10,000,000/2^10 differences greater than 2^10, and 10,000,000/2^15 differences greater than 2^15, etc.

So, we add how many bits it will take to represent our the sequence. We have 1,000,000*6 + roundup(10,000,000/2^5)*6+roundup(10,000,000/2^10)*6+roundup(10,000,000/2^15)*6+roundup(10,000,000/2^20)*4=7935479.

2^24 = 8388608. Since 8388608 > 7935479, we should easily have enough memory. We will probably need another little bit of memory to store the sum of where are when we insert new numbers. We then go through the sequence, and find where to insert our new number, decrease the next difference if necessary, and shift everything after it right.

查看更多
别忘想泡老子
6楼-- · 2020-02-16 05:46

If we don't know anything about those numbers, we are limited by the following constraints:

  • we need to load all numbers before we can sort them them,
  • the set of numbers is not compressible.

If these assumptions hold, there is no way to carry out your task, as you will need at least 26,575,425 bits of storage (3,321,929 bytes).

What can you tell us about your data ?

查看更多
我命由我不由天
7楼-- · 2020-02-16 05:47

Google's (bad) approach, from HN thread. Store RLE-style counts.

Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)

Their problem doesn't seem to cover duplicates, but let's say they use "0:1" for duplicates.

Big problem #1: insertions for 1M integers would take ages.

Big problem #2: like all plain delta encoding solutions, some distributions can't be covered this way. For example, 1m integers with distances 0:99 (e.g. +99 each one). Now think the same but with random distance in the range of 0:99. (Note: 99999999/1000000 = 99.99)

Google's approach is both unworthy (slow) and incorrect. But to their defense, their problem might have been slightly different.

查看更多
登录 后发表回答