Parallel jobs don't finish in scikit-learn'

2019-03-15 18:44发布

问题:

In the following script, I'm finding that the jobs launched by GridSearchCV seem to hang.

import json
import pandas as pd
import numpy as np
import unicodedata
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import SGDClassifier
import sklearn.cross_validation as CV
from sklearn.grid_search import GridSearchCV
from nltk.stem import WordNetLemmatizer

# Seed for randomization. Set to some definite integer for debugging and set to None for production
seed = None


### Text processing functions ###

def normalize(string):#Remove diacritics and whatevs
    return "".join(ch.lower() for ch in unicodedata.normalize('NFD', string) if not unicodedata.combining(ch))

wnl = WordNetLemmatizer()
def tokenize(string):#Ignores special characters and punct
    return [wnl.lemmatize(token) for token in re.compile('\w\w+').findall(string)]

def ngrammer(tokens):#Gets all grams in each ingredient
    max_n = 2
    return [":".join(tokens[idx:idx+n]) for n in np.arange(1,1 + min(max_n,len(tokens))) for idx in range(len(tokens) + 1 - n)]

print("Importing training data...")
with open('/Users/josh/dev/kaggle/whats-cooking/data/train.json','rt') as file:
    recipes_train_json = json.load(file)

# Build the grams for the training data
print('\nBuilding n-grams from input data...')
for recipe in recipes_train_json:
    recipe['grams'] = [term for ingredient in recipe['ingredients'] for term in ngrammer(tokenize(normalize(ingredient)))]

# Build vocabulary from training data grams. 
vocabulary = list({gram for recipe in recipes_train_json for gram in recipe['grams']})

# Stuff everything into a dataframe. 
ids_index = pd.Index([recipe['id'] for recipe in recipes_train_json],name='id')
recipes_train = pd.DataFrame([{'cuisine': recipe['cuisine'], 'ingredients': " ".join(recipe['grams'])} for recipe in recipes_train_json],columns=['cuisine','ingredients'], index=ids_index)


# Extract data for fitting
fit_data = recipes_train['ingredients'].values
fit_target = recipes_train['cuisine'].values

# extracting numerical features from the ingredient text
feature_ext = Pipeline([('vect', CountVectorizer(vocabulary=vocabulary)),
                        ('tfidf', TfidfTransformer(use_idf=True)),
                        ('svd', TruncatedSVD(n_components=1000))
])
lsa_fit_data = feature_ext.fit_transform(fit_data)

# Build SGD Classifier
clf =  SGDClassifier(random_state=seed)
# Hyperparameter grid for GRidSearchCV. 
parameters = {
    'alpha': np.logspace(-6,-2,5),
}

# Init GridSearchCV with k-fold CV object
cv = CV.KFold(lsa_fit_data.shape[0], n_folds=3, shuffle=True, random_state=seed)
gs_clf = GridSearchCV(
    estimator=clf,
    param_grid=parameters,
    n_jobs=-1,
    cv=cv,
    scoring='accuracy',
    verbose=2    
)
# Fit on training data
print("\nPerforming grid search over hyperparameters...")
gs_clf.fit(lsa_fit_data, fit_target)

The console output is:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-06 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=1e-05 .....................................................
[CV] alpha=0.0001 ....................................................
[CV] alpha=0.0001 .................................................... 

And then it just hangs. If I set n_jobs=1 in GridSearchCV, then the script completes as expected with output:

Importing training data...

Building n-grams from input data...

Performing grid search over hyperparameters...
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.5s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.6s
[CV] alpha=1e-06 .....................................................
[CV] ............................................ alpha=1e-06 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.7s
[CV] alpha=1e-05 .....................................................
[CV] ............................................ alpha=1e-05 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.6s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.0001 ....................................................
[CV] ........................................... alpha=0.0001 -   6.7s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   7.0s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.8s
[CV] alpha=0.001 .....................................................
[CV] ............................................ alpha=0.001 -   6.6s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   6.7s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.3s
[CV] alpha=0.01 ......................................................
[CV] ............................................. alpha=0.01 -   7.1s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.7min finished

The single-threaded execution finishes pretty quickly so I'm sure I'm giving the parallel job case enough time to do the calculation itself.

Environment specs: MacBook Pro (15-inch, Mid 2010), 2.4 GHz Intel Core i5, 8 GB 1067 MHz DDR3, OSX 10.10.5, python 3.4.3, ipython 3.2.0, numpy v1.9.3, scipy 0.16.0, scikit-learn v0.16.1 (python and packages all from anaconda distro)

Some additional comments:

I use n_jobs=-1 with GridSearchCV all the time on this machine without issue, so my platform does support the functionality. It usually has 4 jobs out a time, as I've got 4 cores on this machine (2 physical, but 4 "virtual cores" due to Mac hyperthreading). But unless I misunderstand the console output, in this case it has 8 jobs out without any returning. Watching CPU usage in Activity Monitor in real time, 4 jobs launch, work a bit, then finish (or die?) followed by 4 more that launch, work a bit, and then go completely idle but stick around.

At no point do I see significant memory pressure. The main process tops at about 1GB real mem, the child processes at around 600MB. By the time they hang, real memory is negligible.

The script works fine with multiple jobs if one removes the TruncatedSVD step from the feature extraction pipeline. Note, though, that this pipeline acts before the grid search and is not part of the GridSearchCV job(s).

This script is for the kaggle competition What's Cooking? so if you want to try run it on the same data I'm using, you can grab it from there. The data comes as a JSON array of objects. Each object represents a recipe and contains a list of text snippets which are the ingredients. Since each sample is a collection of documents instead of a single document, I ended up having to write some of my own n-gramming and tokenization logic since I couldn't figure out how to get the built-in transformers of scikit-learn to do exactly what I want. I doubt any of that matters but just an FYI.

I usually run scripts within the iPython CLI with %run, but I get the same behavior running it from the OSX bash terminal with python (3.4.3) directly.

回答1:

This might be an issue with multiprocessing used by GridSearchCV if njob>1. So rather than using multiprocessing, you can try multithreading to see if it works fine.

from sklearn.externals.joblib import parallel_backend

clf = GridSearchCV(...)
with parallel_backend('threading'):
    clf.fit(x_train, y_train)

I was having the same issue with my estimator using GSV with njob >1 and using this works great across njob values.

PS: I am not sure if "threading" would have same advantages as "multiprocessing" for all estimators. But theoretically, "threading" would not be a great choice if your estimator is limited by GIL but if the estimator is a cython/numpy based it would be better than "multiprocessing"

System tried on:

MAC OS: 10.12.6
Python: 3.6
numpy==1.13.3
pandas==0.21.0
scikit-learn==0.19.1


回答2:

I believe I had similar issue and the culprit was a sudden memory usage spike. The process would try to allocate memory and immediately die because there is not enough available

If you have access to a machine with much more memory available (like 128-256GB) it is worth checking with the same or lower number of jobs (n_jobs=4) there. This is how I resolved that anyway - just moved my script to a massive server.



回答3:

I was able to solve a similar problem by explicitly setting the random seed:

np.random.seed(0).

My problem was caused by running GSCV multiple times, so this might not apply directly to your use case.