I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of 'true output' and outputs of my 3 models I see that each time at least one of the models is really close to the true output, though 2 others could be relatively far away.
When I compute minimal possible error (if I take prediction from 'best' predictor for each test example) I get a error which is much smaller than error of any model alone. So I thought about trying to combine predictions from these 3 diffent models into some kind of ensemble. Question is, how to do this properly? All my 3 models are build and tuned using scikit-learn, does it provide some kind of a method which could be used to pack models into ensemble? The problem here is that I don't want to just average predictions from all three models, I want to do this with weighting, where weighting should be determined based on properties of specific example.
Even if scikit-learn not provides such functionality, it would be nice if someone knows how to property address this task - of figuring out the weighting of each model for each example in data. I think that it might be done by a separate regressor built on top of all these 3 models, which will try output optimal weights for each of 3 models, but I am not sure if this is the best way of doing this.
Late response, but I wanted to add one practical point for this sort of stacked regression approach (which I use this frequently in my work).
You may want to choose an algorithm for the stacker which allows positive=True (for example, ElasticNet). I have found that, when you have one relatively stronger model, the unconstrained LinearRegression() model will often fit a larger positive coefficient to the stronger and a negative coefficient to the weaker model.
Unless you actually believe that your weaker model has negative predictive power, this is not a helpful outcome. Very similar to having high multi-colinearity between features of a regular regression model. Causes all sorts of edge effects.
This comment applies most significantly to noisy data situations. If you're aiming to get RSQ of 0.9-0.95-0.99, you'd probably want to throw out the model which was getting a negative weighting.
This is a known interesting (and often painful!) problem with hierarchical predictions. A problem with training a number of predictors over the train data, then training a higher predictor over them, again using the train data - has to do with the bias-variance decomposition.
Suppose you have two predictors, one essentially an overfitting version of the other, then the former will appear over the train set to be better than latter. The combining predictor will favor the former for no true reason, just because it cannot distinguish overfitting from true high-quality prediction.
The known way of dealing with this is to prepare, for each row in the train data, for each of the predictors, a prediction for the row, based on a model not fit for this row. For the overfitting version, e.g., this won't produce a good result for the row, on average. The combining predictor will then be able to better assess a fair model for combining the lower-level predictors.
Shahar Azulay & I wrote a transformer stage for dealing with this:
Here is an example of the improvement for the setting described in @MaximHaytovich's answer.
First, some setup:
Note that
x0
andx1
are just noisy versions ofy
. We'll use the first 80 rows for train, and the last 20 for test.These are the two predictors: a higher-variance gradient booster, and a linear predictor:
Here is the methodology suggested in the answer:
Now, using stacking:
The stacking prediction does better. It realizes that the gradient booster is not that great.
Ok, after spending some time on googling 'stacking' (as mentioned by @andreas earlier) I found out how I could do the weighting in python even with scikit-learn. Consider the below:
I train a set of my regression models (as mentioned SVR, LassoLars and GradientBoostingRegressor). Then I run all of them on training data (same data which was used for training of each of these 3 regressors). I get predictions for examples with each of my algorithms and save these 3 results into pandas dataframe with columns 'predictedSVR', 'predictedLASSO' and 'predictedGBR'. And I add the final column into this datafrane which I call 'predicted' which is a real prediction value.
Then I just train a linear regression on this new dataframe:
So when I want to make a prediction for new example I just run each of my 3 regressors separately and then I do:
on outputs of my 3 regressors. And get a result.
The problem here is that I am finding optimal weights for regressors 'on average, the weights will be same for each example on which I will try to make prediction.
What you describe is called "stacking" which is not implemented in scikit-learn yet, but I think contributions would be welcome. An ensemble that just averages will be in pretty soon: https://github.com/scikit-learn/scikit-learn/pull/4161