How to find the best degree of polynomials?

2020-05-19 10:05发布

问题:

I'm new to Machine Learning and currently got stuck with this. First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regression to reduce the bias.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)

poly_predict = poly_reg.predict(X_poly)
poly_mse = mean_squared_error(X, poly_predict)
poly_rmse = np.sqrt(poly_mse)
poly_rmse

Then I got slightly better result than linear regression, then I continued to set degree = 3/4/5, the result kept getting better. But it might be somewhat overfitting as degree increased.

The best degree of polynomial should be the degree that generates the lowest RMSE in cross validation set. But I don't have any idea how to achieve that. Should I use GridSearchCV? or any other method?

Much appreciate if you could me with this.

回答1:

You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.

There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).

You can also play with how you decide to hold out your train/test/validation data.

import numpy as np
import matplotlib.pyplot as plt 

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X = np.arange(100).reshape(100, 1)
y = X**4 + X**3 + X + 1

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rmses = []
degrees = np.arange(1, 10)
min_rmse, min_deg = 1e10, 0

for deg in degrees:

    # Train features
    poly_features = PolynomialFeatures(degree=deg, include_bias=False)
    x_poly_train = poly_features.fit_transform(x_train)

    # Linear regression
    poly_reg = LinearRegression()
    poly_reg.fit(x_poly_train, y_train)

    # Compare with test data
    x_poly_test = poly_features.fit_transform(x_test)
    poly_predict = poly_reg.predict(x_poly_test)
    poly_mse = mean_squared_error(y_test, poly_predict)
    poly_rmse = np.sqrt(poly_mse)
    rmses.append(poly_rmse)

    # Cross-validation of degree
    if min_rmse > poly_rmse:
        min_rmse = poly_rmse
        min_deg = deg

# Plot and present results
print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, rmses)
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')

This will print:

Best degree 4 with RMSE 1.27689038706e-08

Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.



回答2:

In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.

Here is an example how to use this library:

Firstly let us define a method to sample random data:

def make_data(N, err=1.0, rseed=1):

    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 1. / (X.ravel() + 0.3)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

Build a pipeline:

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

Create a data and a vector(X_test) for testing and visualisation purposes:

X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]

Define the GridSearchCV parameters:

param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)

Get the best parameters from our model:

model = grid.best_estimator_
model

Pipeline(memory=None,
     steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

Fit the model with the X and y data and use the vector to predict the values:

y_test = model.fit(X, y).predict(X_test)

Visualize the result:

plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')

The best fit result

The full code snippet:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

def make_data(N, err=1.0, rseed=1):

    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 1. / (X.ravel() + 0.3)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))


X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]

param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)

model = grid.best_estimator_

y_test = model.fit(X, y).predict(X_test)

plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')


回答3:

This is where Bayesian model selection comes in really. This gives you the most likely model given both model complexity and data fit. I'm super tired so the quick answer is to use the BIC (Bayesian information criterion):

k = number of variables in the model
n = number of observations
sse = sum(residuals**2)
BIC = n*ln(sse/n) + k*ln(n) 

This BIC (or AIC etc) will give you the best model