Google Cloud ML-engine scikit-learn prediction pro

2020-05-27 06:31发布

问题:

Google Cloud ML-engine supports the ability to deploy scikit-learn Pipeline objects. For example a text classification Pipeline could look like the following,

classifier = Pipeline([
('vect', CountVectorizer()), 
('clf', naive_bayes.MultinomialNB())])

The classifier can be trained,

classifier.fit(train_x, train_y)

Then the classifier can be uploaded to Google Cloud Storage,

model = 'model.joblib'
joblib.dump(classifier, model)
model_remote_path = os.path.join('gs://', bucket_name, datetime.datetime.now().strftime('model_%Y%m%d_%H%M%S'), model)
subprocess.check_call(['gsutil', 'cp', model, model_remote_path], stderr=sys.stdout)

Then a Model and Version can be created, either through the Google Cloud Console, or programmatically, linking the 'model.joblib' file to the Version.

This classifier can then be used to predict new data by calling the deployed model predict endpoint,

ml = discovery.build('ml','v1')
project_id = 'projects/{}/models/{}'.format(project_name, model_name)
if version_name is not None:
    project_id += '/versions/{}'.format(version_name)
request_dict = {'instances':['Test data']}
ml_request = ml.projects().predict(name=project_id, body=request_dict).execute()

The Google Cloud ML-engine calls the predict function of the classifier and returns the predicted class. However, I would like to be able to return the confidence score. Normally this could be achieved by calling the predict_proba function of the classier, however there doesn't seem to be the option to change the called function. My question is: Is it possible to return the confidence score for a scikit-learn classifier when using the Google Cloud ML-engine? If not, would you have any recommendations as to how else to achieve this result?

Update: I've found a hacky solution. It involved overwriting the predict function of the classifier with its own predict_proba function,

nb = naive_bayes.MultinomialNB()
nb.predict = nb.predict_proba
classifier = Pipeline([
('vect', CountVectorizer()), 
('clf', nb)])

Surprisingly this works. If anyone knows of a neater solution then please let me know.

Update: Google have released a new feature (currently in beta) called Custom prediction routines. This allows you to define what code is run when a prediction request comes in. It adds more code to the solution, but it certainly less hacky.

回答1:

The ML Engine API you are using, only has the predict method, as you can see in the documentation, so it will only do the prediction (unless you force it to do something else with the hack you mentioned).

If you want to do something else with your trained model, you’ll have to load it and use it normally. If you want to use the model stored in Cloud Storage you can do something like:

from google.cloud import storage
from sklearn.externals import joblib

bucket_name = "<BUCKET_NAME>"
gs_model = "path/to/model.joblib"  # path in your Cloud Storage bucket
local_model = "/path/to/model.joblib"  # path in your local machine

client = storage.Client()
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(gs_model)
blob.download_to_filename(local_model)

model = joblib.load(local_model)
model.predict_proba(test_data)