背后模behavor数学(mathematics behind modulo behavor)

2019-08-16 15:44发布

前言

这个问题是不是(P)RNG和行为rand() 这是关于使用两个值的权柄模均匀分布。

介绍

我知道,一个人不应该使用模%转换值从范围到另一个,例如从一开始0和5之间的值rand()函数:会有偏差。 它在这里解释https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default和这个答案为什么人们说,有模偏置使用随机数生成器时?

但是调查一些代码,在找错后的今天,我做了一个工具来演示模的behavor: https://gitorious.org/modulo-test/modulo-test/trees/master ,发现这是不够清晰。

甲骰子只有3比特

我在范围0..5 6点的值进行检查。 只需要3个位来编码这些值。

$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
  [0..7] => [0..5]

theorical occurences    1666.67 probability 0.16666667

   [   0] occurences    2446    probability 0.24460000 ( +46.76%)
   [   1] occurences    2535    probability 0.25350000 ( +52.10%)
   [   2] occurences    1275    probability 0.12750000 ( -23.50%)
   [   3] occurences    1297    probability 0.12970000 ( -22.18%)
   [   4] occurences    1216    probability 0.12160000 ( -27.04%)
   [   5] occurences    1231    probability 0.12310000 ( -26.14%)

  minimum occurences    1216.00 probability 0.12160000 ( -27.04%)
  maximum occurences    2535.00 probability 0.25350000 ( +52.10%)
     mean occurences    1666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     639.43 probability 0.06394256 (  38.37%)

随着输入的3位,结果确实是可怕的,但像预期的那样。 见回答https://stackoverflow.com/a/14614899/611560

增加的输入的比特数

令我困惑的,是加大投入的位数做出的结果不同。 你不应该忘记,增加迭代次数,如样品的数量,否则结果可能是错误的(见错误统计 )。

让我们从4位尝试:

$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
  [0..15] => [0..5]

theorical occurences    3333.33 probability 0.16666667

   [   0] occurences    3728    probability 0.18640000 ( +11.84%)
   [   1] occurences    3763    probability 0.18815000 ( +12.89%)
   [   2] occurences    3675    probability 0.18375000 ( +10.25%)
   [   3] occurences    3721    probability 0.18605000 ( +11.63%)
   [   4] occurences    2573    probability 0.12865000 ( -22.81%)
   [   5] occurences    2540    probability 0.12700000 ( -23.80%)

  minimum occurences    2540.00 probability 0.12700000 ( -23.80%)
  maximum occurences    3763.00 probability 0.18815000 ( +12.89%)
     mean occurences    3333.33 probability 0.16666667 (  +0.00%)
   stddev occurences     602.48 probability 0.03012376 (  18.07%)

让我们从5位尝试:

$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
  [0..31] => [0..5]

theorical occurences    6666.67 probability 0.16666667

   [   0] occurences    7462    probability 0.18655000 ( +11.93%)
   [   1] occurences    7444    probability 0.18610000 ( +11.66%)
   [   2] occurences    6318    probability 0.15795000 (  -5.23%)
   [   3] occurences    6265    probability 0.15662500 (  -6.03%)
   [   4] occurences    6334    probability 0.15835000 (  -4.99%)
   [   5] occurences    6177    probability 0.15442500 (  -7.34%)

  minimum occurences    6177.00 probability 0.15442500 (  -7.34%)
  maximum occurences    7462.00 probability 0.18655000 ( +11.93%)
     mean occurences    6666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     611.58 probability 0.01528949 (   9.17%)

让我们用6位尝试:

$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
  [0..63] => [0..5]

theorical occurences   13333.33 probability 0.16666667

   [   0] occurences   13741    probability 0.17176250 (  +3.06%)
   [   1] occurences   13610    probability 0.17012500 (  +2.08%)
   [   2] occurences   13890    probability 0.17362500 (  +4.18%)
   [   3] occurences   13702    probability 0.17127500 (  +2.77%)
   [   4] occurences   12492    probability 0.15615000 (  -6.31%)
   [   5] occurences   12565    probability 0.15706250 (  -5.76%)

  minimum occurences   12492.00 probability 0.15615000 (  -6.31%)
  maximum occurences   13890.00 probability 0.17362500 (  +4.18%)
     mean occurences   13333.33 probability 0.16666667 (  +0.00%)
   stddev occurences     630.35 probability 0.00787938 (   4.73%)

请解释一下我为什么结果改变时,输入位(和相应增加样本数)有什么不同? 什么是这背后的数学推理?

统计错误

在以前的版本的问题,我表现出与输入的32位,只有百万次迭代,如10 ^ 6个样品测试,说我很惊讶地得到正确的结果。 这是真的错了,我惭愧的:必须有N次样本有信心获得发电机的所有2 ^ 32个值。 这里10 ^ 6是方式小compaired至2 ^ 32。 奖金能够在数学/统计的语言来解释这个人。

此错误的结果:

$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
  [0..4294967295] => [0..5]

theorical occurences  166666.67 probability 0.16666667

   [   0] occurences  166881    probability 0.16688100 (  +0.13%)
   [   1] occurences  166881    probability 0.16688100 (  +0.13%)
   [   2] occurences  166487    probability 0.16648700 (  -0.11%)
   [   3] occurences  166484    probability 0.16648400 (  -0.11%)
   [   4] occurences  166750    probability 0.16675000 (  +0.05%)
   [   5] occurences  166517    probability 0.16651700 (  -0.09%)

  minimum occurences  166484.00 probability 0.16648400 (  -0.11%)
  maximum occurences  166881.00 probability 0.16688100 (  +0.13%)
     mean occurences  166666.67 probability 0.16666667 (  +0.00%)
   stddev occurences     193.32 probability 0.00019332 (   0.12%)

我仍然有阅读和重新阅读的优秀文章Zed的邵氏 “程序员需要了解统计数据,否则我将杀死他们” 。

Answer 1:

从本质上说,你正在做的:

(rand() & 7) % 6

让我们假设rand()均匀地分布在[0; RAND_MAX] [0; RAND_MAX]RAND_MAX+1是二的幂。 清楚的是, rand() & 7可以评估到01 ,..., 7 ,并且该结果是等概率。

现在,让我们来看看,当你把结果模会发生什么6

  • 0和6映射至0;
  • 图1和7到图1;
  • 2名映射到2;
  • 3名映射到3;
  • 4名映射到4;
  • 5名映射到5。

这就解释了为什么你得到两倍多的零和一,你得到其他号码。

同样的事情也发生在第二种情况下。 然而,“额外”数字的值要小得多,使他们的贡献与噪声区分。

总之,如果你有一个整数的[均匀分布的0 ; M-1 ],并且你把它模N ,结果将朝向零偏置除非M是整除N



Answer 2:

rand()或一些其它PRNG)在区间产生的值[0 .. RAND_MAX] 要将这些映射到间隔[0 .. N-1]使用求余运算符。

(RAND_MAX+1) = q*N + r

0 <= r < N

然后在间隔的每个值[0 .. N-1]

  • q+1的值rand()被映射到该值,如果值小于r
  • q的值rand()被映射到其值是否>= r

现在,如果q是小的,之间的相对差qq+1是大的,但如果q是大- 2^32 / 6例如-的差不能容易地进行测定。



文章来源: mathematics behind modulo behavor