How to calculate the Kolmogorov-Smirnov statistic

2019-04-11 13:09发布

问题:

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.

The way we do that in python follows:

def ks_w(data1,data2,wei1,wei2):
    ix1=np.argsort(data1)
    ix2=np.argsort(data2)
    wei1=wei1[ix1]
    wei2=wei2[ix2]
    data1=data1[ix1]
    data2=data2[ix2]
    d=0.
    fn1=0.
    fn2=0.
    j1=0
    j2=0
    j1w=0.
    j2w=0.
    while(j1<len(data1))&(j2<len(data2)):
            d1=data1[j1]
            d2=data2[j2]
            w1=wei1[j1]
            w2=wei2[j2]
            if d1<=d2:
                    j1+=1
                    j1w+=w1
                    fn1=(j1w)/sum(wei1)
            if d2<=d1:
                    j2+=1
                    j2w+=w2
                    fn2=(j2w)/sum(wei2)
            if abs(fn2-fn1)>d:
                    d=abs(fn2-fn1)
    return d

where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.

Our questions are:

  • is anybody aware of any other way to do it?
  • is there any library in python/R/* that performs it?
  • what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?

回答1:

Studying the scipy.stats.ks_2samp code we were able to find a more efficient python solution. We share below in case anyone is interested:

from __future__ import division  # (for python 2/3 support)

import numpy as np

def ks_w2(data1, data2, wei1, wei2):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
    cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
    return np.max(np.abs(cdf1we - cdf2we))

To evaluate the performance we performed the following test:

ds1 = random.rand(10000)
ds2 = random.randn(40000) + .2
we1 = random.rand(10000) + 1.
we2 = random.rand(40000) + 1.

ks_w2(ds1, ds2, we1, we2) took 11.7ms on our machine, while ks_w(ds1, ds2, we1, we2) took 1.43s