how to implement walk forward testing in sklearn?

2019-03-09 19:11发布

In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:enter image description here

to cross validate a time series data, the training and testing data are often splitted like this:enter image description here

That is to say, the testing data should be always ahead of training data.

My thought is:

  1. Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.

  2. Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.

Any suggestion is welcome.

2条回答
虎瘦雄心在
2楼-- · 2019-03-09 20:00

I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.

After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.

查看更多
成全新的幸福
3楼-- · 2019-03-09 20:01

My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.

What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.

查看更多
登录 后发表回答