In the following script, I'm finding that the jobs launched by GridSearchCV seem to hang.
import json
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
import sklearn.cross_validation as CV
from sklearn.grid_search import GridSearchCV
from nltk.stem import WordNetLemmatizer
# Seed for randomization. Set to some definite integer for debugging and set to None for production
seed = None
### Text processing functions ###
def normalize(string):#Remove diacritics and whatevs
return "".join(ch.lower() for ch in unicodedata.normalize('NFD', string) if not unicodedata.combining(ch))
wnl = WordNetLemmatizer()
def tokenize(string):#Ignores special characters and punct
return [wnl.lemmatize(token) for token in re.compile('\w\w+').findall(string)]
def ngrammer(tokens):#Gets all grams in each ingredient
max_n = 2
return [":".join(tokens[idx:idx+n]) for n in np.arange(1,1 + min(max_n,len(tokens))) for idx in range(len(tokens) + 1 - n)]
print("Importing training data...")
with open('/Users/josh/dev/kaggle/whats-cooking/data/train.json','rt') as file:
recipes_train_json = json.load(file)
# Build the grams for the training data
print('\nBuilding n-grams from input data...')
for recipe in recipes_train_json:
recipe['grams'] = [term for ingredient in recipe['ingredients'] for term in ngrammer(tokenize(normalize(ingredient)))]
# Build vocabulary from training data grams.
vocabulary = list({gram for recipe in recipes_train_json for gram in recipe['grams']})
# Stuff everything into a dataframe.
ids_index = pd.Index([recipe['id'] for recipe in recipes_train_json],name='id')
recipes_train = pd.DataFrame([{'cuisine': recipe['cuisine'], 'ingredients': " ".join(recipe['grams'])} for recipe in recipes_train_json],columns=['cuisine','ingredients'], index=ids_index)
# Extract data for fitting
fit_data = recipes_train['ingredients'].values
fit_target = recipes_train['cuisine'].values
# extracting numerical features from the ingredient text
feature_ext = Pipeline([('vect', CountVectorizer(vocabulary=vocabulary)),
('tfidf', TfidfTransformer(use_idf=True)),
('svd', TruncatedSVD(n_components=1000))
])
lsa_fit_data = feature_ext.fit_transform(fit_data)
# Build SGD Classifier
clf = SGDClassifier(random_state=seed)
# Hyperparameter grid for GRidSearchCV.
parameters = {
'alpha': np.logspace(-6,-2,5),
}
# Init GridSearchCV with k-fold CV object
cv = CV.KFold(lsa_fit_data.shape[0], n_folds=3, shuffle=True, random_state=seed)
gs_clf = GridSearchCV(
estimator=clf,
param_grid=parameters,
n_jobs=-1,
cv=cv,
scoring='accuracy',
verbose=2
)
# Fit on training data
print("\nPerforming grid search over hyperparameters...")
gs_clf.fit(lsa_fit_data, fit_target)
The console output is:
Importing training data...
Building n-grams from input data...
Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=0.0001 ....................................................
[CV] alpha=0.0001 ....................................................
And then it just hangs. If I set n_jobs=1
in GridSearchCV
, then the script completes as expected with output:
Importing training data...
Building n-grams from input data...
Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 - 6.5s
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 - 6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 - 6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 - 6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 - 6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 - 6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 - 6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 - 6.7s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 - 6.7s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 - 7.0s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 - 6.8s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 - 6.6s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 - 6.7s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 - 7.3s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 - 7.1s
[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 1.7min finished
The single-threaded execution finishes pretty quickly so I'm sure I'm giving the parallel job case enough time to do the calculation itself.
Environment specs: MacBook Pro (15-inch, Mid 2010), 2.4 GHz Intel Core i5, 8 GB 1067 MHz DDR3, OSX 10.10.5, python 3.4.3, ipython 3.2.0, numpy v1.9.3, scipy 0.16.0, scikit-learn v0.16.1 (python and packages all from anaconda distro)
Some additional comments:
I use n_jobs=-1
with GridSearchCV
all the time on this machine without issue, so my platform does support the functionality. It usually has 4 jobs out a time, as I've got 4 cores on this machine (2 physical, but 4 "virtual cores" due to Mac hyperthreading). But unless I misunderstand the console output, in this case it has 8 jobs out without any returning. Watching CPU usage in Activity Monitor in real time, 4 jobs launch, work a bit, then finish (or die?) followed by 4 more that launch, work a bit, and then go completely idle but stick around.
At no point do I see significant memory pressure. The main process tops at about 1GB real mem, the child processes at around 600MB. By the time they hang, real memory is negligible.
The script works fine with multiple jobs if one removes the TruncatedSVD
step from the feature extraction pipeline. Note, though, that this pipeline acts before the grid search and is not part of the GridSearchCV
job(s).
This script is for the kaggle competition What's Cooking? so if you want to try run it on the same data I'm using, you can grab it from there. The data comes as a JSON array of objects. Each object represents a recipe and contains a list of text snippets which are the ingredients. Since each sample is a collection of documents instead of a single document, I ended up having to write some of my own n-gramming and tokenization logic since I couldn't figure out how to get the built-in transformers of scikit-learn to do exactly what I want. I doubt any of that matters but just an FYI.
I usually run scripts within the iPython CLI with %run, but I get the same behavior running it from the OSX bash terminal with python (3.4.3) directly.