I'm currently benchmarking some data structures in C++ and I want to test them when working on Zipf-distributed numbers.
I'm using the generator provided on this site: http://www.cse.usf.edu/~christen/tools/toolpage.html
I adapted the implementation to use a Mersenne Twister generator.
It works well but it is really slow. In my case, the range can be big (about a million) and the number of random numbers generate can be several millions.
The alpha parameter does not change over time, it is fixed.
I tried to precaculate all the sum_prob. It's much faster, but still slows on big range.
Is there a faster way to generate Zipf distributed numbers ? Even something less precise will be welcome.
Thanks
The only C++11 Zipf random generator I could find calculated the probabilities explicitly and used
std::discrete_distribution
. This works fine for small ranges, but is not useful if you need to generate Zipf values with a very wide range (for database testing, in my case) since it will exhaust memory. So, I implemented the below-mentioned algorithm in C++.I have not rigorously tested this code, and some optimizations are probably possible, but it only requires constant space and seems to work well.
In the meantime there is a faster way based on rejection inversion sampling, see code here.
As a complement to the very nice rejection-inversion implementation given above, here's a C++ class, with the same API, that is simpler and faster for a small number of bins, only. On my machine, its about 2.3x faster for N=300. It's faster because it performs a direct table lookup, instead of computing logs and powers. The table eats cache, though... Making a guess based on the size of my CPU's d-cache, I imagine that the proper rejection-inversion algo given above will become faster for something around N=35K, maybe. Also, initializing the table requires a call to
std::pow()
for each bin, so this wins performance only if you are drawing more than N values out of it. Otherwise, rejection-inversion is faster. Choose wisely.(I've set up the API so it looks a lot like what the std::c++ standards committee might come up with.)
The following line in your code is executed
n
times for each call tozipf()
:It is regrettable that it is necessary to call the
pow()
function because, internally, this function sums not one but two Taylor series [considering thatpow(x, alpha) == exp(alpha*log(x))
]. Ifalpha
is an integer, of course, then you can speed the code up a lot by replacingpow()
with simple multiplication. Ifalpha
is a rational number, then you may be able to speed the code up to a lesser degree by coding a Newton-Raphson iteration to take the place of the two Taylor series. If the last condition holds, please advise.Fortunately, you have indicated that
alpha
does not change. Can you not speed the code up a lot by preparing a table ofpow((double) i, alpha)
, then lettingzipf()
look numbers up the table? That way,zipf()
would not have to callpow()
at all. I suspect that this would save significant time.Yet further improvements are possible. What if you factored a function
sumprob()
out ofzipf()
? Could you not prepare an even more aggressive look-up table forsumprob()
's use?Maybe some of these ideas will move you in the right direction. See what you cannot do with them.
Update: I see that your question as now revised may not be able to use this answer. From the present point, your question may resolve into a question in complex variable theory. Such are often not easy questions, as you know. It may be that a sufficiently clever mathematician has discovered a relevant recurrence relation or some trick like the normal distribution's Box-Muller technique but, if so, I am not acquainted with the technique. Good luck. (It probably does not matter to you but, in case it does, the late N. N. Lebedev's excellent 1972 book Special Functions and Their Applications is available in English translation from the Russian in an inexpensive paperback edition. If you really, really wanted to crack this problem, you might read Lebedev next -- but, of course, that is a desperate measure, isn't it?)
The pre-calculation alone does not help so much. But as it's obvious the sum_prob is accumulative and has ascending order. So if we use a binary-search to find the zipf_value we would decrease the order of generating a Zipf distributed number from O(n) to O(log(n)). Which is so much improvement in efficiency.
Here it is, just replace the
zipf()
function ingenzipf.c
with following one: