选择没有任何的权重(概率相等)精美描述这里 。
我在想,如果有一种方法,以这种方式转换为加权之一。
我也有兴趣在其他方法为好。
更新: 不放回抽样
选择没有任何的权重(概率相等)精美描述这里 。
我在想,如果有一种方法,以这种方式转换为加权之一。
我也有兴趣在其他方法为好。
更新: 不放回抽样
我知道这是一个非常古老的问题,但我觉得有一个绝招,如果你申请一个小数学为此在O(n)的时间!
该指数分布有两个非常有用的特性。
给定来自具有不同速率参数的不同指数分布n个样本,在给定样品是最小的概率等于其率参数由所有速率参数的总和。
这是“记忆”。 所以,如果你已经知道了最低,那么概率,任何剩余的元素是第二次到分钟是一样的概率,如果真分钟被拆除(永不产生的),该元素将是新的分钟。 这似乎是显而易见的,但我认为由于一些条件概率的问题,它可能不是其他的发行也是如此。
使用事实1中,我们知道,选择一个单个元件可以通过生成这些指数分布样本速率参数等于重量,然后选择一个具有最小值来完成。
使用其实2,我们知道,我们不必重新生成指数样本。 取而代之的是,仅仅产生一个对于每个元素,并采取与最低采样k个元素。
找到最低的K可在O(N)来完成。 使用Quickselect算法找到的第k个元素,则简单地采取另一路经比第k个下的所有元素和输出所有。
一个有用的提示:如果你没有到库立即访问生成指数分布的样本,它可以很容易地完成: -ln(rand())/weight
如果采样与更换,你可以使用这个算法(在Python在这里实现):
import random
items = [(10, "low"),
(100, "mid"),
(890, "large")]
def weighted_sample(items, n):
total = float(sum(w for w, v in items))
i = 0
w, v = items[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w, v = items[i]
w -= x
yield v
n -= 1
这是O(N + M)其中m是项数。
为什么这项工作? 它是基于以下算法:
def n_random_numbers_decreasing(v, n):
"""Like reversed(sorted(v * random() for i in range(n))),
but faster because we avoid sorting."""
while n:
v *= random.random() ** (1.0 / n)
yield v
n -= 1
该功能weighted_sample
就是该算法与中散步融合items
清单,挑选出那些随机数选择的项目。
反过来,这工作,因为概率N个随机数0 .. v都将发生小于z是P =(Z / v)的 正 。 求解Z,你会得到Z = 1 的vP / N。 代随机数对于P挑选具有正确的分布数量最多; 我们可以重复刚才的过程,选择其他所有的数字。
如果取样是不用更换,可以把所有的物品转换成二进制堆,其中每个节点缓存总在subheap所有项目的权重。 构建堆是O(M)。 从堆中选择随机项,尊重的权重,为O(日志米 )。 删除该项目并更新缓存总计也为O(log M)。 所以,你可以选择n项的O(M + N登陆米 )时间。
(注:“重量”在这里意味着每次选择一个元素时,剩余的可能性被选择的概率正比于它们的权重它并不意味着元素出现在与正比于它们的权重的似然性的输出。)。
下面是一个实现,丰满评论:
import random
class Node:
# Each node in the heap has a weight, value, and total weight.
# The total weight, self.tw, is self.w plus the weight of any children.
__slots__ = ['w', 'v', 'tw']
def __init__(self, w, v, tw):
self.w, self.v, self.tw = w, v, tw
def rws_heap(items):
# h is the heap. It's like a binary tree that lives in an array.
# It has a Node for each pair in `items`. h[1] is the root. Each
# other Node h[i] has a parent at h[i>>1]. Each node has up to 2
# children, h[i<<1] and h[(i<<1)+1]. To get this nice simple
# arithmetic, we have to leave h[0] vacant.
h = [None] # leave h[0] vacant
for w, v in items:
h.append(Node(w, v, w))
for i in range(len(h) - 1, 1, -1): # total up the tws
h[i>>1].tw += h[i].tw # add h[i]'s total to its parent
return h
def rws_heap_pop(h):
gas = h[1].tw * random.random() # start with a random amount of gas
i = 1 # start driving at the root
while gas >= h[i].w: # while we have enough gas to get past node i:
gas -= h[i].w # drive past node i
i <<= 1 # move to first child
if gas >= h[i].tw: # if we have enough gas:
gas -= h[i].tw # drive past first child and descendants
i += 1 # move to second child
w = h[i].w # out of gas! h[i] is the selected node.
v = h[i].v
h[i].w = 0 # make sure this node isn't chosen again
while i: # fix up total weights
h[i].tw -= w
i >>= 1
return v
def random_weighted_sample_no_replacement(items, n):
heap = rws_heap(items) # just make a heap...
for i in range(n):
yield rws_heap_pop(heap) # and pop n items off it.
如果采样是与更换,使用轮盘赌轮选择技术(通常在遗传算法中使用的):
[0,1]*totalWeight
k
倍 如果采样是不用更换,可以通过在每次迭代后从列表中删除所选择的元件适应上述技术中,然后重新归一化的权重,使得它们的和为1(有效概率分布函数)
我在Ruby中做到了这一点
https://github.com/fl00r/pickup
require 'pickup'
pond = {
"selmon" => 1,
"carp" => 4,
"crucian" => 3,
"herring" => 6,
"sturgeon" => 8,
"gudgeon" => 10,
"minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"
如果你想生成与更换随机整数的大型阵列,可以使用分段线性插值。 例如,使用NumPy的/ SciPy的:
import numpy
import scipy.interpolate
def weighted_randint(weights, size=None):
"""Given an n-element vector of weights, randomly sample
integers up to n with probabilities proportional to weights"""
n = weights.size
# normalize so that the weights sum to unity
weights = weights / numpy.linalg.norm(weights, 1)
# cumulative sum of weights
cumulative_weights = weights.cumsum()
# piecewise-linear interpolating function whose domain is
# the unit interval and whose range is the integers up to n
f = scipy.interpolate.interp1d(
numpy.hstack((0.0, weights)),
numpy.arange(n + 1), kind='linear')
return f(numpy.random.random(size=size)).astype(int)
如果你想不用更换品尝这不是有效的。
下面是一个围棋实现geodns :
package foo
import (
"log"
"math/rand"
)
type server struct {
Weight int
data interface{}
}
func foo(servers []server) {
// servers list is already sorted by the Weight attribute
// number of items to pick
max := 4
result := make([]server, max)
sum := 0
for _, r := range servers {
sum += r.Weight
}
for si := 0; si < max; si++ {
n := rand.Intn(sum + 1)
s := 0
for i := range servers {
s += int(servers[i].Weight)
if s >= n {
log.Println("Picked record", i, servers[i])
sum -= servers[i].Weight
result[si] = servers[i]
// remove the server from the list
servers = append(servers[:i], servers[i+1:]...)
break
}
}
}
return result
}
如果你想选择从加权集合X的元素,无需更换,以致元素被选择的概率正比于它们的权重:
import random
def weighted_choose_subset(weighted_set, count):
"""Return a random sample of count elements from a weighted set.
weighted_set should be a sequence of tuples of the form
(item, weight), for example: [('a', 1), ('b', 2), ('c', 3)]
Each element from weighted_set shows up at most once in the
result, and the relative likelihood of two particular elements
showing up is equal to the ratio of their weights.
This works as follows:
1.) Line up the items along the number line from [0, the sum
of all weights) such that each item occupies a segment of
length equal to its weight.
2.) Randomly pick a number "start" in the range [0, total
weight / count).
3.) Find all the points "start + n/count" (for all integers n
such that the point is within our segments) and yield the set
containing the items marked by those points.
Note that this implementation may not return each possible
subset. For example, with the input ([('a': 1), ('b': 1),
('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
'c'] and ['b', 'd'], but it will do so such that the weights
are respected.
This implementation only works for nonnegative integral
weights. The highest weight in the input set must be less
than the total weight divided by the count; otherwise it would
be impossible to respect the weights while never returning
that element more than once per invocation.
"""
if count == 0:
return []
total_weight = 0
max_weight = 0
borders = []
for item, weight in weighted_set:
if weight < 0:
raise RuntimeError("All weights must be positive integers")
# Scale up weights so dividing total_weight / count doesn't truncate:
weight *= count
total_weight += weight
borders.append(total_weight)
max_weight = max(max_weight, weight)
step = int(total_weight / count)
if max_weight > step:
raise RuntimeError(
"Each weight must be less than total weight / count")
next_stop = random.randint(0, step - 1)
results = []
current = 0
for i in range(count):
while borders[current] <= next_stop:
current += 1
results.append(weighted_set[current][0])
next_stop += step
return results
在这个问题您链接到,凯尔的解决方案将有一个简单的概括工作。 浏览列表,总结总重量。 然后选择一个元素的概率应该是:
1 - (1 - (#需要/(重量左)))/(重量以n)。 访问一个节点后,减去这是由总重量。 另外,如果你需要n和有N个离开了,你必须停止明确。
您可以检查具有权重为1的一切,这简化了凯尔的解决方案。
编辑:(不得不重新思考什么两倍,可能意味着)
这其中不正是与O(n)和没有多余的内存使用情况。 我相信这是一个聪明的和有效的解决方案很容易地移植到任何语言。 前两行只是在Drupal来填充样本数据。
function getNrandomGuysWithWeight($numitems){
$q = db_query('SELECT id, weight FROM theTableWithTheData');
$q = $q->fetchAll();
$accum = 0;
foreach($q as $r){
$accum += $r->weight;
$r->weight = $accum;
}
$out = array();
while(count($out) < $numitems && count($q)){
$n = rand(0,$accum);
$lessaccum = NULL;
$prevaccum = 0;
$idxrm = 0;
foreach($q as $i=>$r){
if(($lessaccum == NULL) && ($n <= $r->weight)){
$out[] = $r->id;
$lessaccum = $r->weight- $prevaccum;
$accum -= $lessaccum;
$idxrm = $i;
}else if($lessaccum){
$r->weight -= $lessaccum;
}
$prevaccum = $r->weight;
}
unset($q[$idxrm]);
}
return $out;
}
我在这里把一个简单的解决方案采摘1项,可以轻松地扩展它的k个项目(Java的风格):
double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
val = items[i];
sum += val.getValue();
if (sum > random) {
selected = val;
break;
}
}
我已经实现类似杰森Orendorff在锈思想的算法在这里 。 我的版本还支持批量操作:插入和删除(如果你想,而不是通过加权选择路径中删除了一堆根据ID给出的项目),从数据结构在O(m + log n)
时间,其中M是数在存放物品的拆卸和n项的数量。
采样wihout置换递归 - 在C#优雅很短的解决方案
//我们有多少种方法可以选择4出的60名学生,所以我们每次选择不同的4
class Program
{
static void Main(string[] args)
{
int group = 60;
int studentsToChoose = 4;
Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
}
private static int FindNumberOfStudents(int studentsToChoose, int group)
{
if (studentsToChoose == group || studentsToChoose == 0)
return 1;
return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
}
}
我只花了几个小时试图让无需更换底层采样那里的算法后面,这个题目比我最初预想的要复杂。 真令人兴奋! 对于未来的读者的利益(有一个美好的一天!)我在这里记录我的见解,包括准备使用的功能,尊重给出包含概率下文。 各种方法一个很好的和快速的数学概述可以在这里找到: Tille的:采样算法具有相等或不等概率 。 例如杰森的方法可以46页上发现他的方法需要说明的是,权重是不成正比的包含概率和文件中还指出。 其实,第i个包含概率可以递归计算如下:
def inclusion_probability(i, weights, k):
"""
Computes the inclusion probability of the i-th element
in a randomly sampled k-tuple using Jason's algorithm
(see https://stackoverflow.com/a/2149533/7729124)
"""
if k <= 0: return 0
cum_p = 0
for j, weight in enumerate(weights):
# compute the probability of j being selected considering the weights
p = weight / sum(weights)
if i == j:
# if this is the target element, we don't have to go deeper,
# since we know that i is included
cum_p += p
else:
# if this is not the target element, than we compute the conditional
# inclusion probability of i under the constraint that j is included
cond_i = i if i < j else i-1
cond_weights = weights[:j] + weights[j+1:]
cond_p = inclusion_probability(cond_i, cond_weights, k-1)
cum_p += p * cond_p
return cum_p
我们可以通过对比检查功能的有效性以上
In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85
至
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
一种方式 - 还建议上述文件 - 指定包含概率是从他们计算的权重。 问题的手头的整体复杂性的事实,人们不能直接做到这一点,因为一个基本上具有反转递推公式,象征性地我要求这是不可能的茎。 从数字可以使用所有种类的方法,例如牛顿的方法来实现。 然而,使用普通的Python反转雅可比的复杂性迅速变得难以忍受,我真的建议寻找到numpy.random.choice
在这种情况下。
幸运的是方法,使用普通的Python这可能是也可能不是你的目的充分高性能,如果没有那么多不同的权重它的伟大工程。 你可以找到75页和76上的算法。 它的工作原理是分手了采样过程分成部分以相同的包含概率,也就是说,我们可以用random.sample
了! 我不打算在此解释一下,因为原则的基础是很好的呈现69.这里页上的代码的注释希望有足够量:
def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
"""
Returns a random sample of k elements from items, where items is a list of
tuples (weight, element). The inclusion probability of an element in the
final sample is given by
k * weight / sum(weights).
Note that the function raises if a inclusion probability cannot be
satisfied, e.g the following call is obviously illegal:
sample_no_replacement_exact([(1,'a'),(2,'b')],2)
Since selecting two elements means selecting both all the time,
'b' cannot be selected twice as often as 'a'. In general it can be hard to
spot if the weights are illegal and the function does *not* always raise
an exception in that case. To remedy the situation you can pass
best_effort=True which redistributes the inclusion probability mass
if necessary. Note that the inclusion probabilities will change
if deemed necessary.
The algorithm is based on the splitting procedure on page 75/76 in:
http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
Additional information can be found here:
https://stackoverflow.com/questions/2140787/
:param items: list of tuples of type weight,element
:param k: length of resulting sample
:param best_effort: fix inclusion probabilities if necessary,
(optional, defaults to False)
:param random_: random module to use (optional, defaults to the
standard random module)
:param ε: fuzziness parameter when testing for zero in the context
of floating point arithmetic (optional, defaults to 1e-9)
:return: random sample set of size k
:exception: throws ValueError in case of bad parameters,
throws AssertionError in case of algorithmic impossibilities
"""
# random_ defaults to the random submodule
if not random_:
random_ = random
# special case empty return set
if k <= 0:
return set()
if k > len(items):
raise ValueError("resulting tuple length exceeds number of elements (k > n)")
# sort items by weight
items = sorted(items, key=lambda item: item[0])
# extract the weights and elements
weights, elements = list(zip(*items))
# compute the inclusion probabilities (short: π) of the elements
scaling_factor = k / sum(weights)
π = [scaling_factor * weight for weight in weights]
# in case of best_effort: if a inclusion probability exceeds 1,
# try to rebalance the probabilities such that:
# a) no probability exceeds 1,
# b) the probabilities still sum to k, and
# c) the probability masses flow from top to bottom:
# [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
# (remember that π is sorted)
if best_effort and π[-1] > 1 + ε:
# probability mass we still we have to distribute
debt = 0.
for i in reversed(range(len(π))):
if π[i] > 1.:
# an 'offender', take away excess
debt += π[i] - 1.
π[i] = 1.
else:
# case π[i] < 1, i.e. 'save' element
# maximum we can transfer from debt to π[i] and still not
# exceed 1 is computed by the minimum of:
# a) 1 - π[i], and
# b) debt
max_transfer = min(debt, 1. - π[i])
debt -= max_transfer
π[i] += max_transfer
assert debt < ε, "best effort rebalancing failed (impossible)"
# make sure we are talking about probabilities
if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
raise ValueError("inclusion probabilities not satisfiable: {}" \
.format(list(zip(π, elements))))
# special case equal probabilities
# (up to fuzziness parameter, remember that π is sorted)
if π[-1] < π[0] + ε:
return set(random_.sample(elements, k))
# compute the two possible lambda values, see formula 7 on page 75
# (remember that π is sorted)
λ1 = π[0] * len(π) / k
λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
λ = min(λ1, λ2)
# there are two cases now, see also page 69
# CASE 1
# with probability λ we are in the equal probability case
# where all elements have the same inclusion probability
if random_.random() < λ:
return set(random_.sample(elements, k))
# CASE 2:
# with probability 1-λ we are in the case of a new sample without
# replacement problem which is strictly simpler,
# it has the following new probabilities (see page 75, π^{(2)}):
new_π = [
(π_i - λ * k / len(π))
/
(1 - λ)
for π_i in π
]
new_items = list(zip(new_π, elements))
# the first few probabilities might be 0, remove them
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
while new_items and new_items[0][0] < ε:
new_items = new_items[1:]
# the last few probabilities might be 1, remove them and mark them as selected
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
selected_elements = set()
while new_items and new_items[-1][0] > 1 - ε:
selected_elements.add(new_items[-1][1])
new_items = new_items[:-1]
# the algorithm reduces the length of the sample problem,
# it is guaranteed that:
# if λ = λ1: the first item has probability 0
# if λ = λ2: the last item has probability 1
assert len(new_items) < len(items), "problem was not simplified (impossible)"
# recursive call with the simpler sample problem
# NOTE: we have to make sure that the selected elements are included
return sample_no_replacement_exact(
new_items,
k - len(selected_elements),
best_effort=best_effort,
random_=random_,
ε=ε
) | selected_elements
例:
In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
的权重之和至多10个,因此包含概率计算到: a
→20%, b
→40%, c
→60%, d
→80%。 (总和:200%= K),它的工作原理!
只是谨慎的生产使用这个功能的一个词,它可以是非常难以发现的权重非法输入。 一个明显的例子非法的
In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]
b
如不能两倍经常出现a
因为两者必须总是被选择。 还有更微妙的例子。 为了避免生产的异常只是使用BEST_EFFORT = True,这重新平衡包含概率质量,使得我们始终有效的分配。 显然,这可能会改变包含概率。
我用一个关联图(重量,对象)。 例如:
{
(10,"low"),
(100,"mid"),
(10000,"large")
}
total=10110
偷看0和“总”之间的随机数和迭代键,直到这个数在给定的范围内配合。