Generate random numbers distributed by Zipf

2019-03-12 10:54发布

问题:

The Zipf probability distribution is often used to model file size distribution or item access distributions on items in P2P systems. e.g. "Web Caching and Zip like Distribution Evidence and Implications", but neither Boost or the GSL (Gnu Scientific Library) provide an implementation to generate random numbers using this distribution. I have not found a (trustworthy) implementation using the common search engines.

How can random numbers that are distributed according to the Zipf distribution by using a U(0,1) random generator, e.g. the Mersenne twister?

回答1:

zipfR is a free and open source library implemented with R. VGAM is another R package that also implements Zipf.

It's also worth noting that the Gnu Scientific Library has an implementation of the Pareto distribution which is effectively the continuous analogue of the discrete Zipf distribution.

Also, the Zeta distribution is equivalent to Zipf for infinite N. The GSL has an implementation of the Riemann zeta function, so you could use that to construct the distribution yourself.



回答2:

numpy.random.zipf generates Zipf samples using python.



回答3:

Here's a Python Zipf-like distribution generator for n items with parameter alpha >= 0:

import random 
import bisect 
import math 

class ZipfGenerator: 

    def __init__(self, n, alpha): 
        # Calculate Zeta values from 1 to n: 
        tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
        zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) 

        # Store the translation map: 
        self.distMap = [x / zeta[-1] for x in zeta] 

    def next(self): 
        # Take a uniform 0-1 pseudo-random value: 
        u = random.random()  

        # Translate the Zipf variable: 
        return bisect.bisect(self.distMap, u) - 1


回答4:

A very efficient algorithm to generate Zipf distributed random variates was recently developed for the next versions (>= 3.6) of the Apache Commons Math library (see code here). It makes use of rejection-inversion sampling and also works for exponents less than 1. It does not require precalculating the CDF and keeping it in memory. Furthermore, the costs for generating one sample are constant and do not increase with the number of items.



回答5:

We were discussing the answer of @stanga in this thread. There are some nice optimizations suggested for his algorithm.