Confusion regarding PATRICIA [closed]

2019-04-17 18:06发布

问题:

According to points 3 and 4 of libstdc++ documentation, PATRICIA tries have two types of nodes:

A (PATRICIA) trie is similar to a tree, but with the following differences:

  1. It explicitly views keys as a sequence of elements. E.g., a trie can view a string as a sequence of characters; a trie can view a number as a sequence of bits.

  2. It is not (necessarily) binary. Each node has fan-out n + 1, where n is the number of distinct elements.

  3. It stores values only at leaf nodes.

  4. Internal nodes have the properties that A) each has at least two children, and B) each shares the same prefix with any of its descendant.

The book I've been reading (Algorithms in C, Parts 1-4 by Robert Sedgewick) seems to describe a PATRICIA trie storing n values with only n nodes, using internal nodes to store values:

Like DSTs, patricia tries allow search for N keys in a tree with just N nodes. ... we avoid external nodes via another simple device: we store data in internal nodes and replace links to external nodes with links that point back upwards to the correct internal node in the trie

It seems there are two camps of belief here:

  1. On the one hand we have a strict, specific definition (i.e. Sedgewick, Knuth, Morrison who all seem to describe PATRICIA exclusively as a prefix-compressed binary tree with one-way branching eliminated); and
  2. Then we have those believing the term forms a loose, vague definition which seems more like they meant to use a word like "map", "dictionary" or "trie" (which are all actually loosely defined, i.e. the libcstd++ documentation).

I guess I'm concerned about the accuracy of my resources. As I understand, due to problems introduced by common prefixes, it isn't possible to represent a tree with just N nodes without presenting it as a binary tree (which seems to violate point 2 of libcstd++ docs, and point 4 when dealing with variable-width keys), and without losing the notion of strict one-way branching (violation of points 3 and 4 by rendering the concept of "leaf nodes" and "children" somewhat invalid). The two features work in tandem to eliminate the dilemma that is "internal nodes" that would cause such trees to use more than N nodes (recall: N items with N just nodes).

These two groups of references can't both be correct; there's too much mutual exclusion. Where one reference says PATRICIA is binary and another says it might not be, they can't both be considered factually correct, and that's just one example of inconsistency I see here. Which of these references are correct?

回答1:

I continued to search for a specific definition from past reputable sources to confirm what I had suspected, and I'm writing to provide my findings. Perhaps the most significant is the official paper defining PATRICIA, published by DR Morrison in October 1968s "Journal of the ACM":

PATRICIA evolved from "A Library Automaton" [3] and other studies. ... Early in this evolution it was decided that the alphabet should be restricted to a binary one. A theorem which strongly influenced this decision is one which, in another form, is due to Euler. The theorem states that if the alphabet is binary, then the number of branches is exactly one less than the number of ends. Corollaries state that as the library grows, each new end brings into the library with it exactly one new branch, and each branch has exactly two exits. These facts are very useful in the allocation of storage for the index. They imply that the total storage required is completely determined by the number of ends, and all of the storage required will actually be used.

This certainly contradicts points 2 and 3 of the libstdc++ reference. There's further evidence in this paper, such as specific algorithm details, but the quote above should suffice.

There don't appear to be any deviations from the official description in the Sedgewick quote, however. Based on that, the libstdc++ resource is certainly less valid than the Sedgewick resource.



回答2:

Although both definitions seem to be correct, first one is more detailed and seems better to me. Also have a look at this answer, where I try to depict the difference between a Patricia and regular Trie.