Iterative or Lazy Reservoir Sampling

2020-07-09 07:46发布

问题:

I'm fairly well acquainted with using Reservoir Sampling to sample from a set of undetermined length in a single pass over the data. One limitation of this approach, in my mind, is that it still requires a pass over the entire data set before any results can be returned. Conceptually this makes sense, since one has to allow items in the entirety of the sequence the opportunity to replace previously encountered items to achieve a uniform sample.

Is there a way to be able to yield some random results before the entire sequence has been evaluated? I'm thinking of the kind of lazy approach that would fit well with python's great itertools library. Perhaps this could be done within some given error tolerance? I'd appreciate any sort of feedback on this idea!

Just to clarify the question a bit, this diagram sums up my understanding of the in-memory vs. streaming tradeoffs of different sampling techniques. What I want is something that falls into the category of Stream Sampling, where we don't know the length of the population beforehand.

Clearly there is a seeming contradiction in not knowing the length a priori and still getting a uniform sample, since we will most likely bias the sample towards the beginning of the population. Is there a way to quantify this bias? Are there tradeoffs to be made? Does anybody have a clever algorithm to solve this problem?

回答1:

If you know in advance the total number of items that will be yielded by an iterable population, it is possible to yield the items of a sample of population as you come to them (not only after reaching the end). If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).

Here's a quick generator that does this:

def sample_given_size(population, population_size, sample_size):
    for item in population:
        if random.random() < sample_size / population_size:
            yield item
            sample_size -= 1
        population_size -= 1

Note that the generator yields items in the order they appear in the population (not in random order, like random.sample or most reservoir sampling codes), so a slice of the sample will not be a random subsample!



回答2:

If population size is known before hand, can't you just generate sample_size random "indices" (in the stream) and use that to do a lazy yield? You won't have to read the entire stream.

For instance, if population_size was 100, and sample_size was 3, you generate a random set of integers from 1 to 100, say you get 10, 67 and 72.

Now you yield the 10th, 62nd and 72nd elements of the stream and ignore the rest.

I guess I don't understand the question.