I'm trying to use the scikit-learn Randomized Logistic Regression feature selection method but I keep running into cases where it kills all the features while fitting, and returns:
ValueError: Found array with 0 feature(s) (shape=(777, 0)) while a minimum of 1 is required.
This is as expected, clearly, because I'm reducing the regularization parameter - C
- to ridiculously low levels (note that this is the inverse of the mathematical regularization parameter lambda
, i.e., C = 1/lambda
so the lower the C, the more extreme the regularization).
My problem is, how can I find in advance the lowest C
I can choose, without manually testing multiple values and crossing out the ones that throw this error?
In my case (starting off with ~250 features), I know C = 0.5
is the lowest I can go. 0.1
, 0.4
and even 0.49
throw an error as they pull my feature set down to 0 (and give the shape = (blah, 0)
error I've pasted above).
On another note (and perhaps this should be a different question) - the higher my C
(that is, the lower my lambda
or regularization parameter) - the more time my machine takes to fit. Add in the fact that I usually run RLR through a pipeline with a StandardScaler before the RLR and an SVM or RF after, and also use cross validation, makes the total time needed to run on my machine explode exponentially.
Without a code it's a little hard to pinpoint the problem, the reason is I don't believe that error is related to your C
value. But to answer that question you'll need GridSearchCV.
The example in there is good enough to get your started:
>>> from sklearn import svm, grid_search, datasets
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svr = svm.SVC()
>>> clf = grid_search.GridSearchCV(svr, parameters)
>>> clf.fit(iris.data, iris.target)
...
GridSearchCV(cv=None, error_score=...,
estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
decision_function_shape=None, degree=..., gamma=...,
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=...,
verbose=False),
fit_params={}, iid=..., n_jobs=1,
param_grid=..., pre_dispatch=..., refit=...,
scoring=..., verbose=...)
You can always take it further by specifying a cross-validation in cv
variable. Also, don't forget to change n_jobs
if your data is large, very helpful.
Now the reason I don't think it's the C
value but more related to the way you're presenting the data to the regression. Again, without code it's hard to see it clearly.
As mentioned in my comment to Leb's answer, the correct answer is that it depends on the data. There is no way (as of right now) for an sklearn.pipeline.Pipeline
or sklearn.grid_search.GridSearchCV
to capture this specific case. If the regularization parameter is tight enough that it culls all the features in the input dataset, and there is nothing left to train on, the upcoming classifiers in the Pipeline
will fail (obviously) when GridSearchCV
is searching for optimal parameters.
The way I've dealt with this situation in my case is by understanding and exploring my data thoroughly before adding any form of feature selection into the Pipeline
.
As an example usage, I take the feature selection transformer outside the Pipeline
and manually fit it over a different set of values. I focus especially on the extremes (very high regularization and very low regularization). This gives me an idea as to when the feature selection transformer culls all features, and when it does no feature selection at all. I then add my feature selection transformer back into the Pipeline
and throw that into GridSearchCV
. Here, I ensure that the searched parameters for the feature selection transformer are comfortably within the two extremes I found earlier - which prevents my GridSearchCV
from hitting a zero-feature case and breaking down.