Probability of finding the median with finite spac

2019-02-06 18:49发布

问题:

This is a spin off of this StackOverflow question.

Assume that you have a fixed number k of storage locations, and space for two counters. You will receive n items in random order (all permutations of the n items are equally likely). After receiving each item you can either store it in one of the k locations (discarding one of the previously stored values), or discard the item. You can also increment or decrement either of the counters. Any discarded item cannot be retrieved.

The questions are

  1. What is the strategy that maximizes your probability of finding the exact median?
  2. What is that probability?

Obviously, if k > n/2, we can find the median. In general it seems that the same strategy of attempting to keep the number of discarded high values equal to the number of discarded low values should be optimal, but I am not sure how to prove it, nor how to figure out the probability that it finds the median.

Also of interest is the case where we do not know n but know the probability distribution of n.

Edit: Assume for now that the values are distinct (no duplicates.) However, if you can solve the non distinct case as well, then that would be impressive.

回答1:

Munro and Paterson studied essentially this problem in their paper Selection and sorting with limited storage. They show that your algorithm requires k = Ω(√n) to succeed with constant probability and that this is asymptotically optimal by appealing to basic results about one-dimensional random walks.

If I wanted to prove absolute optimality, the first thing I would try would be to consider an arbitrary algorithm A and then couple its execution with an algorithm A' that, the first time A deviates from your algorithm, does your algorithm would do instead and then attempts to follow A as closely as it can.



回答2:

A wild guess: discard the element that is farthest from the mean of the currently stored values.

Comparing to the current median doesn't work if the distribution of values is multi-modal and we get values from a non-dominant mode first.