Kolmogorov Smirnov Test in Spark (Python) not work

I was doing a normality test in Python spark-ml and saw what I think is an bug.

Here is the setup, i have a data-set that is normalized (range -1, to 1).

When I do a histogram, i can clearly see that the data is NOT normal:

>>> prices_norm.histogram(10)

([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
 [226, 269, 119, 95, 52, 26, 8, 2, 2, 5])

When I run the Kolmgorov-Smirnov test I get the following results:

>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")
>>> print testResults

Kolmogorov-Smirnov test summary:
degrees of freedom = 0 
statistic = 0.46231145770077375 
pValue = 1.742039845709087E-11 
Very strong presumption against null hypothesis: Sample follows theoretical distribution.

The Kolmgorov-Smirnov test defines the null hypothesis (H0) as: the data follows a specified distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm).

In this case the p-value is very low, so we should reject the null hypothesis. This makes sense, as it is clearly not normal.

So why then, does it say:

Sample follows theoretical distribution

Isn't this wrong? Shouldn't it say that the sample does NOT follow a theoretical distribution? Am I missing something?

标签： python pyspark apache-spark-mllib kolmogorov-smirnov

1条回答

戒情不戒烟

2楼-- · 2019-07-01 21:36

This was driving me crazy, so I went to look at the source code directly:

git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala

The code is correct, the null Hypothesis is set as:

object NullHypothesis extends Enumeration {
  type NullHypothesis = Value
  val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}

The verbiage of the string message is just restating the null hypothesis:

Very strong presumption against null hypothesis: Sample follows theoretical distribution.
                                                 ________________________________________
                                                                    H0

Arguably the verbiage is confusing as it could be interpreted both ways. But it is indeed correct.

0人赞添加讨论(0) 举报

Kolmogorov Smirnov Test in Spark (Python) not work

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间