Weighted percentile using numpy

2019-01-17 04:01发布

Is there a way to use the numpy.percentile function to compute weighted percentile? Or is anyone aware of an alternative python function to compute weighted percentile?

thanks!

9条回答
SAY GOODBYE
2楼-- · 2019-01-17 04:39

I don' know what's Weighted percentile means, but from @Joan Smith's answer, It seems that you just need to repeat every element in ar, you can use numpy.repeat():

import numpy as np
np.repeat([1,2,3], [4,5,6])

the result is:

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3])
查看更多
看我几分像从前
3楼-- · 2019-01-17 04:43

I use this function for my needs:

def quantile_at_values(values, population, weights=None):
    values = numpy.atleast_1d(values).astype(float)
    population = numpy.atleast_1d(population).astype(float)
    # if no weights are given, use equal weights
    if weights is None:
        weights = numpy.ones(population.shape).astype(float)
        normal = float(len(weights))
    # else, check weights                  
    else:                                           
        weights = numpy.atleast_1d(weights).astype(float)
        assert len(weights) == len(population)
        assert (weights >= 0).all()
        normal = numpy.sum(weights)                    
        assert normal > 0.
    quantiles = numpy.array([numpy.sum(weights[population <= value]) for value in values]) / normal
    assert (quantiles >= 0).all() and (quantiles <= 1).all()
    return quantiles
  • It is vectorized as far as I could go.
  • It has a lot of sanity checks.
  • It works with floats as weights.
  • It can work without weights (→ equal weights).
  • It can compute multiple quantiles at once.

Multiply results by 100 if you want percentiles instead of quantiles.

查看更多
再贱就再见
4楼-- · 2019-01-17 04:45

Apologies for the additional (unoriginal) answer (not enough rep to comment on @nayyarv's). His solution worked for me (ie. it replicates the default behavior of np.percentage), but I think you can eliminate the for loop with clues from how the original np.percentage is written.

def weighted_percentile(a, q=np.array([75, 25]), w=None):
    """
    Calculates percentiles associated with a (possibly weighted) array

    Parameters
    ----------
    a : array-like
        The input array from which to calculate percents
    q : array-like
        The percentiles to calculate (0.0 - 100.0)
    w : array-like, optional
        The weights to assign to values of a.  Equal weighting if None
        is specified

    Returns
    -------
    values : np.array
        The values associated with the specified percentiles.  
    """
    # Standardize and sort based on values in a
    q = np.array(q) / 100.0
    if w is None:
        w = np.ones(a.size)
    idx = np.argsort(a)
    a_sort = a[idx]
    w_sort = w[idx]

    # Get the cumulative sum of weights
    ecdf = np.cumsum(w_sort)

    # Find the percentile index positions associated with the percentiles
    p = q * (w.sum() - 1)

    # Find the bounding indices (both low and high)
    idx_low = np.searchsorted(ecdf, p, side='right')
    idx_high = np.searchsorted(ecdf, p + 1, side='right')
    idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1

    # Calculate the weights 
    weights_high = p - np.floor(p)
    weights_low = 1.0 - weights_high

    # Extract the low/high indexes and multiply by the corresponding weights
    x1 = np.take(a_sort, idx_low) * weights_low
    x2 = np.take(a_sort, idx_high) * weights_high

    # Return the average
    return np.add(x1, x2)

# Sample data
a = np.array([1.0, 2.0, 9.0, 3.2, 4.0], dtype=np.float)
w = np.array([2.0, 1.0, 3.0, 4.0, 1.0], dtype=np.float)

# Make an unweighted "copy" of a for testing
a2 = np.repeat(a, w.astype(np.int))

# Tests with different percentiles chosen
q1 = np.linspace(0.0, 100.0, 11)
q2 = np.linspace(5.0, 95.0, 10)
q3 = np.linspace(4.0, 94.0, 10)
for q in (q1, q2, q3):
    assert np.all(weighted_percentile(a, q, w) == np.percentile(a2, q))
查看更多
\"骚年 ilove
5楼-- · 2019-01-17 04:48

A quick solution, by first sorting and then interpolating:

  def weighted_percentile(data, percents, weights=None):
      ''' percents in units of 1%
      weights specifies the frequency (count) of data.
      '''
      if weights is None:
        return np.percentile(data, percents)
      ind=np.argsort(data)
      d=data[ind]
      w=weights[ind]
      p=1.*w.cumsum()/w.sum()*100
      y=np.interp(percents, p, d)
      return y
查看更多
劳资没心,怎么记你
6楼-- · 2019-01-17 04:49

Unfortunately, numpy doesn't have built-in weighted functions for everything, but, you can always put something together.

def weight_array(ar, weights):
     zipped = zip(ar, weights)
     weighted = []
     for i in zipped:
         for j in range(i[1]):
             weighted.append(i[0])
     return weighted


np.percentile(weight_array(ar, weights), 25)
查看更多
我想做一个坏孩纸
7楼-- · 2019-01-17 04:49

As mentioned in comments, simply repeating values is impossible for float weights, and impractical for very large datasets. There is a library that does weighted percentiles here: http://kochanski.org/gpk/code/speechresearch/gmisclib/gmisclib.weighted_percentile-module.html It worked for me.

查看更多
登录 后发表回答