可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Goal
How to encode the data that describes how to re-order a static list from a one order to another order using the minimum amount of data possible?
I have a feeling there is an algorithm or computer science term that will help me but right now I'm too stuck on the problem to figure out other ways of looking at it.
Background Motivation
I'm have a program that is deployed to a remote location where all communication is via an intermittent incredibly expensive satellite connection. It's a slight exaggeration, but data costs are close to a dollar per kilobyte and can only happen a few times per day.
At the start of the day the users are given a list of items, they go out in the field and do stuff but the end result is more or less the same list of items sorted in a different order. There's other data but it's not important to this problem.
Right now I'm sending back a record of all the moves that occur and playing them back in order. As users get comfortable with the system the list of move records is starting to approach the size of just sending back all the items themselves, and often some combination of moves results in undoing previous ones.
Assumptions
- The starting list and ending list is composed of the exact same set of items
- Each item has a unique id (32 bit integer)
- Each item has a unique sort oder (32 bit integer)
- User will have a list of several hundreds to a thousand or more items
- User will generally re-order about 100 of those items in a day
- Changes to the order can be detected moving an item to a new position in the list
- Some "moves" may undo previous ones
- Computation resources for figuring optimal solutions is cheap / unlimited
- Transmission time is expensive
- Sending back the change data is cheaper than sending the back the whole list
Simplest Data Structure
For the purposes of solving this problem assume the following data structures are available.
Here is an example list. The items in each list are the same. Notice that even though only a few of the items have changed, every single item id has a new sort order so you can't just send back new item_id/sort_order_id pairs.
**List 1: Original List** **List 2: Re-ordered List**
order - id order - id
1. 10 1. 90
2. 20 2. 30
3. 30 3. 40
4. 40 4. 50
5. 50 5. 60
6. 60 6. 10
7. 70 7. 80
8. 80 8. 70
9. 90 9. 20
How do I encode the changes required to convert the order of List 1, into the order of List 2 using the minimum amount of data possible?
As a curiosity is it possible to prove that there is a solution is optimal?
Update
A co-worker pointed out that "swap" may not be correct way to think of it. You can also send an item to the top or bottom of the list which is more of a move than a swap. A swap then becomes a combination of two moves.
Thanks for the pointers. So far I don't see a guaranteed optimal solution. Plus the problem just changed a little.
If I can't prove any single method produces the best result then I'll figure out a solution using every method and send back that solution with a small header indicating the method used. Keep suggesting solutions though and I'll update this question with my research.
Thanks everyone!
回答1:
Algo part:
A reordering of a list is called permutation. Each permutation can be split into a set of loops, with each loop of N elements requiring (N - 1) swaps. For example
1, 2, 3, 4, 5, 6 --> 3, 2, 4, 1, 6, 5
This can be split into
1 - 4 - 3 (requires 2 swaps)
2 - 2 (0 swaps)
5 - 6 (1 swap)
To find a solution you can just pick any element at a wrong position and put it on its place.
Details part:
Of course, you can use smaller data types, RLE or some other encoding algorithms and so on.
Very theoretical but non-practical part.
All permutations of a sequence of N numbers can be lexicographically ordered, and one number from 0 to (N! - 1) is enough to represent the sequence. So, theoretically best answer is: compute the index of the permutation, transfer it, recreate the permutation by that index.
回答2:
What you want is the permutation required to sort the list. You can get this by constructing a list of indices from 0 to n, then sorting that list with a custom comparison function that compares the items at the corresponding indices. For example, in Python:
perm = sorted(range(len(l)), key=lambda x:l[x])
You can then send 'perm' over the connection, and use it to get the sorted list:
for x in perm:
print perm[x]
As a further optimization, if most elements remain unchanged, the permutation will be highly compressible - either by using regular compression or by using transforms like difference (eg, store each element as the difference from the previous element, rather than its absolute value), move to front and run length encoding.
回答3:
I am not sure that analyzing the swaps gets you anything; as you say they can undo each other, and lead to confusing results.
I believe that your best option is to identify, in the re-ordered list, the segments of that list that are not-reordered with respect to the original list, even if they start in a new location. In your example, this is the segment from 30 to 60. So in a sort of run length encoding, I would send back a segment map that describes locations and lengths.
Again, using your example data : a list of ordered start index, length :
{ (9, 1) , (3, 4) , (1, 1) , (8, 1) , (7, 1) , (2, 1) }
seems like the smallest amount of info you can send back. The compressability of the data depends on the number and size of segments held in common.
(Edit)
Actually, it occurs to me that there are going to be some data sets where a swap list will be shorter, if the number of swaps is small. But there will probably be some cutover point where run length encoding does better; in that case I would say compute both and pick the smaller one.
回答4:
If you are truly trying to minimize every bit of data going over the wire, how are you transmitting your data? For example, are you compressing it somehow? Using a 32 bit number for sort order is probably overkill if you only have a few thousand items. 16 bits gets you 65000 items for half the $$$. Same goes for the unique ID's.
回答5:
Another possible solution, ignoring your data structure...
Send a set of IDs/indexes for items that have changed (if it's a completely random sparse subset, just list them) and a permutation number describing the re-ordering of that subset. The permutation number will need a big integer representation - size should be proportional to log(n!) where n is the number of items changed.
The permutation number is defined from a permutation array, of course, but this detail can be avoided when decoding. The trick is to encode the permutation number so that, once you have swapped the correct first item into the first slot, you can also derive a new permutation number which is correct for the tail of the array.
That is...
while not empty(indexes)
item-to-swap := permutation-no remainder len(indexes)
permutation-no := permutation-no div len(indexes)
if item-to-swap != 0 : swap slot[indexes[0]], slot[indexes[item-to-swap]]
indexes := tail(indexes)
The != 0 check is needed even though all items needed changing at the start - an item might have been swapped upwards into it's correct location earlier in the loop.
This doesn't attempt to optimise the number of swaps - an item may be swapped upwards several times before being swapped downwards into it's correct location. That said, the permutation number is probably an optimum-space representation for a random permutation of an array. Given that your permutation only affects a small subset of the full array, using the smaller permutation number for that subset makes a lot of sense.
回答6:
A quick fix might be to use a Zobrist hash to spot cases where you go back to a prior order. That is, after each swap, calculate a hash based on the permutation you reach. Each hash maps to the shortest sequence of swaps found so far for that particular permutation.
This can easily be extended with a bit of exploratory searching - the Zobrist hash was invented as a way to optimise game tree searches.
It's easy to give a strict lower bound to the number of swaps, of course - the number of items that are not in their required locations. Whether that lower bound is actually achievable, though, is a more difficult problem.
回答7:
Assuming that:
- You can keep copies of the original and end data on both your field devices and your base system
- When you talk about swaps, you mean two items in the list are swapped with one another
Your best solution is probably:
Rather than keeping a list of all the swaps you do as they are performed, compare your starting and finishing data at the end of the day, and then generate the swaps you would need to make that change. This would ignore any locations in the list that remain unchanged, even if they are only unchanged because a series of swaps "undid" some change. If you have your data take the form of a,b,a,b,...
where a
tells you the index of the next elements to leave in the same order they're in, and b
tells you the index of the item to swap it with.
Because you're only doing swaps instead of shifts, you should very rarely end up with data like your sample data where 30, 40, and 50 are in the same order but in a slightly different location. Since the number of swaps will be between 1/4 and 1/10 the number of original items in the list, you'll usually have a big chunk of your data in both the same order and the same location it was in originally. Let's assume the following swaps were made:
1 <-> 9
4 <-> 2
5 <-> 2
The resulting list would be:
1. 90
2. 50
3. 30
4. 20
5. 40
6. 60
7. 70
8. 80
9. 10
So the change data could be represented as:
1,9,2,4,4,5
That's only six values, which could be represented as 16-bit numbers (assuming you won't have over 16,000 items in your initial list). So each "effective" swap could be represented with a single 32-bit number. And since the number of actual swaps will generally be 1/5 to 1/2 the size of the original list, you'll end up sending between 10% and 20% of the data in your original list over the wire (or less since the number of "effective" swaps may be even less if some of those swaps undo one another).
回答8:
As Peter says, it would be ideal to minimise the size of each integer — but in fact, you can do it without putting restrictions on the number of items. Variable-byte encoding is a way of compressing sequences of integers by only using the necessary number of bytes. The most common way of doing this is to reserve one bit in each byte to indicate whether that byte is the last one in the current list item.
It could be useful to use delta encoding first. That's where you store the differences between the integers, rather than the integers themselves — meaning that they end up compressing better with variable-byte. Of course, the integers being stored (perhaps the IDs of items being changed, in your case) would have to be sorted first, but that doesn't seem like it'd be a problem for you.