How to return a node, uniformly at random, from a

2019-05-31 06:46发布

问题:

Given a BST (may or may not be balanced) how can one return "any" node uniformly at random? A constraint is you cannot use an external indexing data structure. You must traverse the tree in such a manner that every node has an equal chance of being visited.

This question has me perplexed for quite a while. If we can indeed use an external hashtable/pointers we could just randomize on those and return the corresponding node. However, my colleague has put forth a rather complex variant of the question, where no additional data structures can be used.

  • A simple random walk with a 50-50 chance of either going L/R doesn't work since the probability of returning a node near the bottom of the tree is much smaller(probabilities compound)
  • Even randomly generating a depth d and traversing at most d nodes to return the node (or stopping if it's a leaf) doesn't generate a uniform distribution.

Update: You cannot have an inorder walk and store the result in an array either.

How can one achieve such a traversal?

回答1:

Walk the tree in any order, keeping the following values:

  • N: the number of nodes seen

  • selected: the currently selected node.

Initially, N is 0 and selected is None. Visiting a node consists of the following:

  1. Increment N

  2. Generate a random integer in the range [0, N).

  3. If the random integer selected is 0, set selected to the current Node.

Note that the values N and selected need to be modified during the walk. That means that they are both input and output values to the visitor function.

At the end of the walk, N will be the number of nodes in the tree, and selected will be a random node selected with uniform probability (assuming you have a good random number generator).

This algorithm is not restricted to BSTs. It will work on any tree of any shape. In particular, it will work on a simple linear sequence of objects of unknown length, corresponding to the well-known random selection algorithm which is to iterate over the objects, replacing the selected random object with the newly visited one with probability 1/N where N is the number of objects seen to date.

If you keep track of visited nodes, it will also work on any connected graph.

If you have a very large tree (or graph), perhaps spread over a number of servers and/or storage devices, you can use a different presentation of this algorithm, which provides for a certain level of parallelism (and also prevents the need to keep a global walk structure or pass values into the walk).

We assume that each node-server has direct access to k objects and indirect access to some known number of children servers. The algorithm allows for redundant children, but assumes that network communication is (almost) perfect; dealing with network splits is outside of the scope of this answer. We also assume that every query has an associated unique query number, which allows us to deal with some networking artifacts. The query has no other information (other than the server to respond to), and is expected to return a tuple consisting of a count and a randomly-selected node.

When a node-server receives a query with id q, it does the following:

  1. If it has previously responded to query q, immediately return <0, null>

  2. Set count to k and selected to a randomly-selected object from the k objects it has direct access to.

  3. For each child server, send the query (with the same query-id)

  4. For each response returned (it doesn't matter what order the responses come in):

    a. Add response.count to count

    b. With probability response.count / count, replace selected with response.selected

  5. When all children servers have responded, return <count, selected>



回答2:

If you know the number of nodes n in the tree, find the kth node in an inorder walk of the tree for some randomly-chosen k between 0 and n-1, inclusive. If you don't, you can walk the tree and figure out its size and then do the above.

If the nodes of the tree are stored contiguously in some array, randomly sample an array element and return it.

If each tree node can tell you the size of the subtree rooted there, figure out how many nodes are in the tree, generate a random k up to the number of nodes in the tree, and select the kth element of the tree.

In all cases above, the "tree" part is a red herring.

There is a random walk on any nontrivial connected graph whose stationary distribution is uniform. Select a neighbour of the current node. If it has lower or equal degree, go there. If it has higher degree, go there with probability cur_deg / that_deg and stay put otherwise. Sampling from this random walk is called in various contexts "Gibbs sampling" and "Metropolis-Hastings."



回答3:

If you have a full (bottom level is completely full) binary search tree then it is possible to quickly (sub-linear time) do as you ask, because you know the structure of the tree. If you want me to post an answer that gives the solution for this case, let me know and I'll update my answer. However, if you just have a generic binary search tree of arbitrary shape, then it is impossible to uniformly sample a node without visiting the entire tree so that you know the shape.