I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection.
The list of numbers may contain duplicates, which I must not discard. The code will be placed in ROM, so I need not subtract the size of my code from the 1 MB. I already have code to drive the Ethernet port and handle TCP/IP connections, and it requires 2 KB for its state data, including a 1 KB buffer via which the code will read and write data. Is there a solution to this problem?
Sources Of Question And Answer:
Suppose this task is possible. Just prior to output, there will be an in-memory representation of the million sorted numbers. How many different such representations are there? Since there may be repeated numbers we can't use nCr (choose), but there is an operation called multichoose that works on multisets.
So theoretically it may be possible, if you can come up with a sane (enough) representation of the sorted list of numbers. For example, an insane representation might require a 10MB lookup table or thousands of lines of code.
However, if "1M RAM" means one million bytes, then clearly there is not enough space. The fact that 5% more memory makes it theoretically possible suggests to me that the representation will have to be VERY efficient and probably not sane.
I think the solution is to combine techniques from video encoding, namely the discrete cosine transformation. In digital video, rather recording the changing the brightness or colour of video as regular values such as 110 112 115 116, each is subtracted from the last (similar to run length encoding). 110 112 115 116 becomes 110 2 3 1. The values, 2 3 1 require less bits than the originals.
So lets say we create a list of the input values as they arrive on the socket. We are storing in each element, not the value, but the offset of the one before it. We sort as we go, so the offsets are only going to be positive. But the offset could be 8 decimal digits wide which this fits in 3 bytes. Each element can't be 3 bytes, so we need to pack these. We could use the top bit of each byte as a "continue bit", indicating that the next byte is part of the number and the lower 7 bits of each byte need to be combined. zero is valid for duplicates.
As the list fills up, the numbers should be get closer together, meaning on average only 1 byte is used to determine the distance to the next value. 7 bits of value and 1 bit of offset if convenient, but there may be a sweet spot that requires less than 8 bits for a "continue" value.
Anyway, I did some experiment. I use a random number generator and I can fit a million sorted 8 digit decimal numbers into about 1279000 bytes. The average space between each number is consistently 99...
Here's some working C++ code which solves the problem.
Proof that the memory constraints are satisfied:Editor: There is no proof of the maximum memory requirements offered by the author either in this post or in his blogs. Since the number of bits necessary to encode a value depends on the values previously encoded, such a proof is likely non-trivial. The author notes that the largest encoded size he could stumble upon empirically was
1011732
, and chose the buffer size1013000
arbitrarily.Together, these two arrays take 1045000 bytes of storage. That leaves 1048576 - 1045000 - 2×1024 = 1528 bytes for remaining variables and stack space.
It runs in about 23 seconds on my Xeon W3520. You can verify that the program works using the following Python script, assuming a program name of
sort1mb.exe
.A detailed explanation of the algorithm can be found in the following series of posts:
To represent the sorted array one can just store the first element and the difference between adjacent elements. In this way we are concerned with encoding 10^6 elements that can sum up to at most 10^8. Let's call this D. To encode the elements of D one can use a Huffman code. The dictionary for the Huffman code can be created on the go and the array updated every time a new item is inserted in the sorted array (insertion sort). Note that when the dictionary changes because of a new item the whole array should be updated to match the new encoding.
The average number of bits for encoding each element of D is maximized if we have equal number of each unique element. Say elements d1, d2, ..., dN in D each appear F times. In that case (in worst case we have both 0 and 10^8 in input sequence) we have
sum(1<=i<=N) F. di = 10^8
where
sum(1<=i<=N) F = 10^6, or F=10^6/N and the normalized frequency will be p= F/10^=1/N
The average number of bits will be -log2(1/P) = log2(N). Under these circumstances we should find a case that maximizes N. This happens if we have consecutive numbers for di starting from 0, or, di= i-1, therefore
10^8=sum(1<=i<=N) F. di = sum(1<=i<=N) (10^6/N) (i-1) = (10^6/N) N (N-1)/2
i.e.
N <= 201. And for this case average number of bits is log2(201)=7.6511 which means we will need around 1 byte per input element for saving the sorted array. Note that this doesn't mean D in general cannot have more than 201 elements. It just sows that if elements of D are uniformly distributed, it cannot have more than 201 unique values.
Please see the first correct answer or the later answer with arithmetic encoding. Below you may find some fun, but not a 100% bullet-proof solution.
This is quite an interesting task and here is an another solution. I hope somebody would find the result useful (or at least interesting).
Stage 1: Initial data structure, rough compression approach, basic results
Let's do some simple math: we have 1M (1048576 bytes) of RAM initially available to store 10^6 8 digit decimal numbers. [0;99999999]. So to store one number 27 bits are needed (taking the assumption that unsigned numbers will be used). Thus, to store a raw stream ~3.5M of RAM will be needed. Somebody already said it doesn't seem to be feasible, but I would say the task can be solved if the input is "good enough". Basically, the idea is to compress the input data with compression factor 0.29 or higher and do sorting in a proper manner.
Let's solve the compression issue first. There are some relevant tests already available:
http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression
It looks like LZMA (Lempel–Ziv–Markov chain algorithm) is a good choice to continue with. I've prepared a simple PoC, but there are still some details to be highlighted:
Please note, attached code is a POC, it can't be used as a final solution, it just demonstrates the idea of using several smaller buffers to store presorted numbers in some optimal way (possibly compressed). LZMA is not proposed as a final solution. It is used as a fastest possible way to introduce a compression to this PoC.
See the PoC code below (please note it just a demo, to compile it LZMA-Java will be needed):
With random numbers it produces the following:
For a simple ascending sequence (one bucket is used) it produces:
EDIT
Conclusion:
Stage 2: Enhanced compression, final conclusion
As was already mentioned in the previous section, any suitable compression technique can be used. So let's get rid of LZMA in favor of simpler and better (if possible) approach. There are a lot of good solutions including Arithmetic coding, Radix tree etc.
Anyway, simple but useful encoding scheme will be more illustrative than yet another external library, providing some nifty algorithm. The actual solution is pretty straightforward: since there are buckets with partially sorted data, deltas can be used instead of numbers.
Random input test shows slightly better results:
Sample code
Please note, this approach:
Full code can be found here, BinaryInput and BinaryOutput implementations can be found here
Final conclusion
No final conclusion :) Sometimes it is really good idea to move one level up and review the task from a meta-level point of view.
It was fun to spend some time with this task. BTW, there are a lot of interesting answers below. Thank you for your attention and happy codding.
(My original answer was wrong, sorry for the bad math, see below the break.)
How about this?
The first 27 bits store the lowest number you have seen, then the difference to the next number seen, encoded as follows: 5 bits to store the number of bits used in storing the difference, then the difference. Use 00000 to indicate that you saw that number again.
This works because as more numbers are inserted, the average difference between numbers goes down, so you use less bits to store the difference as you add more numbers. I believe this is called a delta list.
The worst case I can think of is all numbers evenly spaced (by 100), e.g. Assuming 0 is the first number:
Reddit to the rescue!
If all you had to do was sort them, this problem would be easy. It takes 122k (1 million bits) to store which numbers you have seen (0th bit on if 0 was seen, 2300th bit on if 2300 was seen, etc.
You read the numbers, store them in the bit field, and then shift the bits out while keeping a count.
BUT, you have to remember how many you have seen. I was inspired by the sublist answer above to come up with this scheme:
Instead of using one bit, use either 2 or 27 bits:
I think this works: if there are no duplicates, you have a 244k list. In the worst case you see each number twice (if you see one number three times, it shortens the rest of the list for you), that means you have seen 50,000 more than once, and you have seen 950,000 items 0 or 1 times.
50,000 * 27 + 950,000 * 2 = 396.7k.
You can make further improvements if you use the following encoding:
0 means you did not see the number 10 means you saw it once 11 is how you keep count
Which will, on average, result in 280.7k of storage.
EDIT: my Sunday morning math was wrong.
The worst case is we see 500,000 numbers twice, so the math becomes:
500,000 *27 + 500,000 *2 = 1.77M
The alternate encoding results in an average storage of
500,000 * 27 + 500,000 = 1.70M
: (