Re-ordering an array so it is grouped by identical

2019-05-08 22:40发布

问题:

I'm looking at a problem with processing an array, which can be easily solved with sorting. However, my requirement is actually more relaxed than a full sort: I only need the guarantee that, if there are any identical elements in the array, they will be found adjacent to each other in the array.

Is there an algorithm for re-ordering an array such that it meets the above criteria, which would be more efficient that simply doing a full sort?

回答1:

So the answer is no. This information doesn't really help. It may get you a little better, but not in big O.

To everyone who suggested hashing to get linear time, you can just as well do the same for sorting. This method is called radix/hash sort. It blows up your memory usage.

When there are more restrictions, you can even use cooler tricks (i.e. sum, xor, etc.)

However, for an algorithm that uses comparison only on a generalized array, you're not buying much by reducing the problem this way.

To give a simple intuition for this, suppose you have 1 redundancy for each element, so your array is a1,a1,...an,an (total of 2n elements of n unique numbers).

The size of the solution space is n! (so long as aj-aj are paired, you can permute the pair anyway you want as specified in your problem statement). The size of the input space is (2n)!/(2^(n)).

This means your algorithm needs to produce enough information to arrange ((2n)!/n!)/(2^n) = (n*(n+1)*...2n)/(2^n) elements. Each comparison gives you 1 bit of information. The number of required comparison iterations is log(n)+log(n+1)...+log(2n)-n which is big_omega(nlog(n)). This is not better or worse than sorting.

Here's a semi-rigorous treatment for sorting: http://www.cs.cmu.edu/~avrim/451f11/lectures/lect0913.pdf

I can probably be bribed to generate a similar proof for the current question.



回答2:

If the order is not a problem you can try any of the hashing techniques. All hashing techniques generally lead to grouping of similar items. For each item in the array just use a hash function and all the elements are grouped according to the function that you define.



回答3:

If all elements could be partitioned into two groups, we could solve this problem with a hash.
The complexity is time = O(n) and additional space = O(1).

If all elements could be partitioned into three groups, we could apply the above method twice.
The complexity is time = O(n) * 2 = O(n) and additional space = O(1).

If all elements could be partitioned into four groups, we could apply the first method three times.
The complexity is time = O(n) * 3 = O(n) and additional space = O(1).

If all elements could be partitioned into k groups, we could apply the first method (k-1) times.
The complexity is time = O(n) * (k-1) = O(k*n) and additional space = O(k).

This methodology is superior to sorting in time complexity only when O(k) < O(log n).

Actually, when all elements could be partitioned into three groups, this problem becomes Dutch national flag problem proposed by Edsger Dijkstra.



回答4:

Do you have any constraints on different keys?

If not, you are actually more looking for a hash bag than for any kind of sorting.

Since you did not give any programming language, here is a python example:

from collections import defaultdict

data=[ (1,"a"), (2, "c"), (3, "d"), (2, "b") ]

table = defaultdict(lambda: list())
for key, record in data:
     table[key].append(record)

for key, values in table.iteritems():
    for value in values:
        print key, value

This should run in linear time, as hash table lookups are considered O(1).

If your data is larger than your main memory, a classic external sorting approach may be faster though than hitting an external hash table badly. In general, a full sort can be faster, because the algorithms are well optimized! So do benchmarks!



回答5:

The idea similar to the Python code above, but in CL:

(defun partition (array &key (test #'eql))
  (loop with table = (make-hash-table :test test)
     for i across array do
       (setf (gethash i table) (1+ (gethash i table 0)))
     finally
       (return
         (loop with result = (make-array (length array))
            with pointer = -1
            for key being the hash-keys of table
            for value being the hash-values of table do
              (loop while (> value 0) do
                   (setf (aref result (incf pointer)) key
                         value (1- value)))
            finally (return result)))))

(partition #(1 2 3 "2" "2" "3" 'a 'b 'b 3 3 1 1 "3" 'a))
;; #(1 1 1 2 3 3 3 "2" "2" "3" 'A 'B 'B "3" 'A)

(partition #(1 2 3 "2" "2" "3" 'a 'b 'b 3 3 1 1 "3" 'a) :test #'equal)
;; #(1 1 1 2 3 3 3 "2" "2" "3" "3" 'A 'A 'B 'B)

It also illustrates that the concept of being equal is not a given. You may define what do you treat as equal, and depending on that definition it may or may not make sense to compare it to the speed of sorting (because sorting implies that it is possible to order the sequence).