trying to compare two distributions

I found this code on internet that compares a normal distribution to different student distributions:

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",
  ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){
  lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}

I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:

library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)

The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !

回答1:

If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.

 library(tseries)
 jarque.bera.test(ss)

    Jarque Bera Test

data:  ss 
X-squared = 4100.781, df = 2, p-value < 2.2e-16

Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.

To see why your data is not normaly distributed you can take a look at the descriptive statistics:

 library(fBasics)
 basicStats(ss)
                     ss
nobs        3776.000000
NAs            0.000000
Minimum       -0.105195
Maximum        0.187713
1. Quartile   -0.009417
3. Quartile    0.010220
Mean           0.000462
Median         0.001224
Sum            1.745798
SE Mean        0.000336
LCL Mean      -0.000197
UCL Mean       0.001122
Variance       0.000427
Stdev          0.020671
Skewness       0.322820
Kurtosis       5.060026

From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.

But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:

 plot(density(ss, kernel='epanechnikov'))
 set.seed(125)
 lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)

In this fashion you can generate other curve from another probability distribution.

The tests suggested by @Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.

回答2:

Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.