How are the “error bands” in Seaborn tsplot calcul

2019-04-09 09:06发布

I'm trying to understand how the error bands are calculated in the tsplot. Examples of the error bands are shown here.

When I plot something simple like

sns.tsplot(np.array([[0,1,0,1,0,1,0,1], [1,0,1,0,1,0,1,0], [.5,.5,.5,.5,.5,.5,.5,.5]]))

I get a vertical line at y=0.5 as expected. The top error band is also a vertical line at around y=0.665 and the bottom error band is a vertical line at around y=0.335. Can someone explain how these are derived?

2条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-04-09 09:38

I'm not a statistician, but I read through the seaborn code in order to see exactly what's happening. There are three steps:

  1. Bootstrap resampling. Seaborn creates resampled versions of your data. Each of these is a 3x8 matrix like yours, but each row is randomly selected from the three rows of your input. For example, one might be:

    [[ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
     [ 0.5  0.5  0.5  0.5 0.5 0.5  0.5  0.5]
     [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]]
    

    and another might be:

    [[ 1.   0.   1.   0.   1.   0.   1.   0. ]
     [ 0.5  0.5  0.5  0.5 0.5  0.5  0.5  0.5]
     [ 0.   1.   0.   1.   0.   1.   0.   1. ]]
    

    It creates n_boot of these (10000 by default).

  2. Central tendency estimation. Seaborn runs a function on each of the columns of each of the 10000 resampled versions of your data. Because you didn't specify this argument (estimator), it feeds the columns to a mean function (numpy.mean with axis=0). Lots of your columns in your bootstrap iterations are going to have a mean of 0.5, because they will be things like [0, 0.5, 1], [0.5, 1, 0], [0.5, 0.5, 0.5], etc. but you will also have some [1,1,0] and even some [1,1,1] which will result in higher means.

  3. Confidence interval determination. For each column, seaborn sorts the 1000 estimates of the means calculated from each resampled version of the data from smallest to greatest, and picks the ones which represent the upper and lower CI. By default, it's using a 68% CI, so if you line up all 1000 mean estimates, then it will pick the 160th and the 840th. (840-160 = 680, or 68% of 1000).

A couple of notes:

  • There are actually only 3^3, or 27, possible resampled versions of your array, and if you use a function such as mean where the order doesn't matter then there's only 3!, or 6. So all 10000 bootstrap iterations will be identical to one of those 27 versions, or 6 versions in the unordered case. This means that it's probably silly to do 10000 iterations in this case.

  • The means 0.3333... and 0.6666... that show up as your confidence intervals are the means for [1,1,0] and [1,0,0] or rearranged versions of those.

查看更多
我欲成王,谁敢阻挡
3楼-- · 2019-04-09 09:57

They show a bootstrap confidence interval, computed by resampling units (rows in the 2d array input form). By default it shows a 68 percent confidence interval, which is equivalent to a standard error, but this can be changed with the ci parameter.

查看更多
登录 后发表回答