Is it true that sorting strings is O(n^2logn)? [du

2019-06-18 19:39发布

问题:

This question already has an answer here:

  • String sorting using Merge Sort 3 answers

I read the following:

Sorting takes O(NlogN) so how it is O(N^2logN)??. What we miss here is that comparison of two strings is not O(1); in worst case, it takes O(N). So the final complexity is O(N^2logN).

Is this correct? I had always assumed that sorting was always O(NlogN) but now I feel a little thrown off because it has become O(N^2logN).

If someone could explain why it's O(N^2logN) that would be great.

Edit: quote taken from here: https://www.hackerrank.com/challenges/string-similarity/topics/suffix-array

回答1:

Think of it this way.

When you are sorting numbers, to detect which number is bigger, you need only 1 comparison.

But when sorting strings, to detect which string is bigger, at times you need to compare all the characters of the string (i.e comparing hello and hella would need 5 char comparisons).

So in that case, every string comparison would take time that is directly proportional to the length of the strings. If the length is consistent, (suppose l), then the complexity would become O(l*nlogn)


Do not get confused between n and l. In any time complexity, n would stand for the number of inputs. In your case, complexity will be O(n^2logn) only when the length of the strings are also n.

Otherwise, taking the length of the strings as l, complexity would be O(l*nlogn)



回答2:

You extracted that quote out of context; it's worth putting it back into context to understand what's going on.

The context is that we start with some string S of length N, and wish to sort the set of all possible suffixes of that string.

There are N possible non-empty suffixes, one for each character position, and the average size of a string in that set is (N+1)/2 (since every length from 1 to N is equally likely.)

If we assume that the expected cost of comparing two strings of average length L is O(L), and the expected number of comparisons to sort N objects is O(N log N), and finally since L is (N+1)/2, we can see that the expected time to comparison-sort this particular set of strings is O((N+1)/2×N×logN) which simplifies to O(N2logN).

But the expected cost of comparing two strings of length L is not actually O(L), because it is generally the case that it is not necessary to look at every character position in two strings in order to compare them. It is sufficient to look at the characters until the first non-match is found. If the strings are randomly distributed in the universe of possible strings, then the expected number of characters which need to be examined is actually O(1) (by a simple probabilistic argument), although the worst-case cost is O(L).

In the case of quicksort (at least), there is a counter-balance, which is the tendency for comparisons in later phases of the quicksort to be between strings which are closer together. In this case, the number of characters needed to be examined tends to approach log N, which is the expected length of the longest common prefix between two consecutive strings in a sorted uniformly-distributed sample of N strings.

So the actual expected cost of the sort should be O(Nlog2N).

As a further complication, it is possible to modify the quicksort algorithm to track the LCPs of each partition, which might help restore the expected number of comparisons to something smaller than log N.

(A similar argument can be applied to radix sort, which also eliminates the apparent O(L) term from the expected -- but not worst-case -- time. The cited article proceeds to provide a much better algorithm for the suffix tree problem, which avoids comparison-sorting.)