Python Custom Zipf Number Generator Performing Poo

2019-08-04 10:19发布

问题:

I needed a custom Zipf-like number generator because numpy.random.zipf function doesn't achieve what I need. Firstly, its alpha must be greater than 1.0 and I need an alpha of 0.5. Secondly, its cardinality is directly related to the sample size and I need to make more samples than the cardinality, e.g. make a list of 1000 elements from a Zipfian distribution of only 6 unique values.

@stanga posted a great solution to this.

import random 
import bisect 
import math 

class ZipfGenerator: 

    def __init__(self, n, alpha): 
        # Calculate Zeta values from 1 to n: 
        tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
        zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) 

        # Store the translation map: 
        self.distMap = [x / zeta[-1] for x in zeta] 

    def next(self): 
        # Take a uniform 0-1 pseudo-random value: 
        u = random.random()  

        # Translate the Zipf variable: 
        return bisect.bisect(self.distMap, u) - 1

The alpha can be less than 1.0 and the sampling can be infinite for a fixed cardinality n. The problem is that it runs too slow.

# Calculate Zeta values from 1 to n: 
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])

These two lines are the culprits. When I choose n=50000 I can generate my list in ~10 seconds. I need to execute this when n=5000000 but it's not feasible. I don't fully understand why this is performing so slow because (I think) it has linear complexity and the floating point operations seem simple. I am using Python 2.6.6 on a good server.

Is there an optimization I can make or a different solution altogether that meet my requirements?

EDIT: I'm updating my question with a possible solution using modifications recommended by @ev-br . I've simplified it as a subroutine that returns the entire list. @ev-br was correct to suggest changing bisect for searchssorted as the former proved to be a bottleneck as well.

def randZipf(n, alpha, numSamples): 
    # Calculate Zeta values from 1 to n: 
    tmp = numpy.power( numpy.arange(1, n+1), -alpha )
    zeta = numpy.r_[0.0, numpy.cumsum(tmp)]
    # Store the translation map: 
    distMap = [x / zeta[-1] for x in zeta]
    # Generate an array of uniform 0-1 pseudo-random values: 
    u = numpy.random.random(numSamples)
    # bisect them with distMap
    v = numpy.searchsorted(distMap, u)
    samples = [t-1 for t in v]
    return samples

回答1:

Let me take a small example first

In [1]: import numpy as np

In [2]: import math

In [3]: alpha = 0.1

In [4]: n = 5

In [5]: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]

In [6]: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])

In [7]: tmp
Out[7]: 
[1.0,
 0.9330329915368074,
 0.8959584598407623,
 0.8705505632961241,
 0.8513399225207846]

In [8]: zeta
Out[8]: 
[0,
 1.0,
 1.9330329915368074,
 2.82899145137757,
 3.699542014673694,
 4.550881937194479]

Now, let's try to vectorize it, starting from innermost operations. The reduce call is essentially a cumulative sum:

In [9]: np.cumsum(tmp)
Out[9]: array([ 1.        ,  1.93303299,  2.82899145,  3.69954201,  4.55088194])

You want a leading zero, so let's prepend it:

In [11]: np.r_[0., np.cumsum(tmp)]
Out[11]: 
array([ 0.        ,  1.        ,  1.93303299,  2.82899145,  3.69954201,
        4.55088194])

Your tmp array can be constructed in one go as well:

In [12]: tmp_vec = np.power(np.arange(1, n+1) , -alpha)

In [13]: tmp_vec
Out[13]: array([ 1.        ,  0.93303299,  0.89595846,  0.87055056,  0.85133992])

Now, quick-and-dirty timings

In [14]: %%timeit 
   ....: n = 1000
   ....: tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
   ....: zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
   ....: 
100 loops, best of 3: 3.16 ms per loop

In [15]: %%timeit
   ....: n = 1000
   ....: tmp_vec = np.power(np.arange(1, n+1) , -alpha)
   ....: zeta_vec = np.r_[0., np.cumsum(tmp)]
   ....: 
10000 loops, best of 3: 101 µs per loop

Now, it gets better with increasing n:

In [18]: %%timeit
n = 50000
tmp_vec = np.power(np.arange(1, n+1) , -alpha)
zeta_vec = np.r_[0, np.cumsum(tmp)]
   ....: 
100 loops, best of 3: 3.26 ms per loop

As compared to

In [19]: %%timeit 
n = 50000
tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)]
zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0])
   ....: 
1 loops, best of 3: 7.01 s per loop

Down the line, the call to bisect can be replaced by np.searchsorted.

EDIT: A couple of comments which are not directly relevant to the original question, and are rather based on my guesses of what can trip you down the line:

  • a random generator should accept a seed. You can rely on numpy's global np.random.seed, but better make it an explicit argument defaulting to None (meaning do not seed it.)
  • samples = [t-1 for t in v] is not needed, just return v-1.
  • best avoid mixing camelCase and pep8_lower_case_with_underscores.
  • note that this is very similar to what scipy.stats.rv_discrete is doing. If you only need sampling, you're fine. If you need a full-fledged distribution, you may look into using it.