how does sklearn do Linear regression when p >n?

2019-02-11 03:53发布

问题:

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.

In sklearn I receive this values:

In [30]: lm = LinearRegression().fit(xx,y_train)

In [31]: lm.coef_
Out[31]: 
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
         0.08619906, -0.08108713]])

In [32]: xx.shape
Out[32]: (1097, 3419)

Call [30] should return an error. How does sklearn work when p>n like in this case?

EDIT: It seems that the matrix is filled with some values

if n > m:
        # need to extend b matrix as it will be filled with
        # a larger solution matrix
        if len(b1.shape) == 2:
            b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
            b2[:m,:] = b1
        else:
            b2 = np.zeros(n, dtype=gelss.dtype)
            b2[:m] = b1
        b1 = b2

回答1:

When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.

argmin_w l2_norm(w) subject to Xw = y

This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.

w = np.linalg.pinv(X).dot(y)

The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).

Check out this example

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)

from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)

print(coef1)
print(coef2)

And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)