I am trying out this method as a regularized regression, as an alternative to lasso and elastic net. I have 40k data points and 40 features. Lasso selects 5 features, and orthogonal matching pursuit selects only 1.
What could be causing this? Am I using omp the wrong way? Perhaps it is not meant to be used as a regression. Please let me know if you can thing of anything else I may be doing wrong.
Orthogonal Matching Pursuit seems a bit broken, or at least very sensitive to input data, as implemented in scikit-learn.
Example:
import sklearn.linear_model
import sklearn.datasets
import numpy
X, y, w = sklearn.datasets.make_regression(n_samples=40000, n_features=40, n_informative=10, coef=True, random_state=0)
clf1 = sklearn.linear_model.LassoLarsCV(fit_intercept=True, normalize=False, max_n_alphas=1e6)
clf1.fit(X, y)
clf2 = sklearn.linear_model.OrthogonalMatchingPursuitCV(fit_intercept=True, normalize=False)
clf2.fit(X, y)
# this is 1e-10, LassoLars is basically exact on this data
print numpy.linalg.norm(y - clf1.predict(X))
# this is 7e+8, OMP is broken
print numpy.linalg.norm(y - clf2.predict(X))
Fun experiments:
There are a bunch of canned datasets in sklearn.datasets
. Does OMP fail on all of them? Apparently, it works okay on the diabetes dataset...
Is there any combination of parameters to make_regression
that would generate data that OMP works for? Still looking for that one... 100 x 100 and 100 x 10 fail in the same way.