WordCount: how inefficient is McIlroy's soluti

2020-02-19 08:36发布

问题:

Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

Read the full story at http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/ .

Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.

My question is: how bad is that?
(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)

I can understand that sort -rn | sed ${1}q may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z? It looks pretty good to me. About sort | uniq -c, is that a terribly slow way to determine the frequencies?

A few considerations:

  • tr should be linear time (?)
  • sort I'm not sure of, but I'm assuming it's not that bad
  • uniq should be linear time too
  • spawning processes should be linear time (in the number of processes)

回答1:

The Unix script has a few linear operations and 2 sorts. It will be calculation order O(n log(n)).

For Knuth algorithm for taking only the top N: http://en.wikipedia.org/wiki/Selection_algorithm Where you can have a few options in time and space complexity of the algorithm, but theoretically they can be faster for some typical examples with large number of (different) words.

So Knuth could be faster. Certainly because the English dictionary has limited size. It could turn log(n) in some large constant, though maybe consuming a lot of memory.

But maybe this question is better suited for https://cstheory.stackexchange.com/



回答2:

Doug McIlroy's solution has time complexity O(T log T), where T is the total number of words. This is due to the first sort.

For comparison, here are four faster solutions of the same problem:

Here is a C++ implementation with the upper bound time complexity O((T + N) log N), but practically – nearly linear, close to O(T + N log N).

Below is a fast Python implementation. Internally, it uses hash dictionary and heap with time complexity O(T + N log Q), where Q is the number of unique words:

import collections, re, sys

filename = sys.argv[1]
k = int(sys.argv[2]) if len(sys.argv)>2 else 10
reg = re.compile('[a-z]+')

counts = collections.Counter()
for line in open(filename):
    counts.update(reg.findall(line.lower()))
for i, w in counts.most_common(k):
    print(i, w)

And another Unix shell solution using AWK. It has time complexity O(T + Q log Q):

awk -v FS="[^a-zA-Z]+" '
{
    for (i=1; i<=NF; i++)
        freq[tolower($i)]++;
}
END {
    for (word in freq)
        print(freq[word] " " word)
}
' | sort -rn | head -10

Here is an extremely fast solution in Rust by Anders Kaseorg.

CPU time comparison (in seconds):

                                     bible32       bible256       Asymptotical
Rust (prefix tree)                   0.632         5.284          O(?)
C++ (prefix tree + heap)             4.838         38.587         O((T + N) log N)
Python (Counter)                     14.366        115.855        O(T + N log Q)
AWK + sort                           21.548        176.411        O(T + Q log Q)
McIlroy (tr + sort + uniq)           60.531        690.906        O(T log T)

Notes:

  • T >= Q, typically Q >> N (N is a small constant)
  • bible32 is Bible concatenated 32 times (135 MB), bible256 – 256 times respectively (1.1 GB)
  • AWK timing might greatly vary depending on which version of AWK you are using (gawk, nawk or mawk)

As you can see, McIlroy's solution runs roughly 100 times slower than the fastest known program! However, his solution is still very elegant, easy to debug and, after all, it is not that terrible in performance either, unless you start using it for gigabyte files. Bad implementations of more complex algorithms in C/C++ or Haskell could easily run much slower than his pipeline (I've witnessed that).