Get actual field name from JPMML model's Input

2019-07-24 17:47发布

问题:

I have a scikit model that I'm using in my java app using JPMML. I'm trying to set the InputFields using the name of the column that was used during training, but "inField.getName().getValue()" is obfuscated to "x{#}". Is there anyway i could map "x{#}" back to the original feature/attribute name?

Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
    or (InputField inField : patternEvaluator.getInputFields()) {
        int value = activeFeatures.contains(inField.getName().getValue()) ? 1 : 0;
        FieldValue inputFieldValue = inField.prepare(value);
        arguments.put(inField.getName(), inputFieldValue);              
            }
Map<FieldName, ?> results = patternEvaluator.evaluate(arguments);

Here's how I'm generating the modal

from sklearn2pmml import PMMLPipeline
from sklearn2pmml import PMMLPipeline
import os
import pandas as pd
from sklearn.pipeline import Pipeline
import numpy as np

data = pd.read_csv('/pydata/training.csv')
X = data[data.keys()[:-1]].as_matrix()
y = data['classname'].as_matrix()

X_train, X_test, y_train, y_test =    train_test_split(X,y,test_size=0.3,random_state=0)

estimators = [("read", RandomForestClassifier(n_jobs=5,n_estimators=200, max_features='auto'))]    
pipe = PMMLPipeline(estimators)
pipe.fit(X_train,y_train)
pipe.active_fields = np.array(data.columns)
sklearn2pmml(pipe, "/pydata/model.pmml", with_repr = True)

Thanks

回答1:

Does the PMML document contain actual field names at all? Open it in a text editor, and see what are the values of /PMML/DataDictionary/DataField@name attributes.

Your question indicates that the conversion from Scikit-Learn to PMML was incomplete, because it didn't include information about active field (aka input field) names. In that case they are assumed to be x1, x2, .., xn.



回答2:

Your pipeline only includes the estimator, that is why the names are lost. You have to include all the preprocessing steps as well in order to get them into the PMML.

Let's assume you do not do any preprocessing at all, then that is probably what you need (I do not repeat parts of your code which are required in this snippet):

nones = [(d, None) for d in data.columns]

mapper = DataFrameMapper(nones,df_out=True)

lm = PMMLPipeline([
    ("mapper", mapper),
    ("estimator", estimators)
])

lm.fit(X_train,y_train)

sklearn2pmml(lm, "ScikitLearnNew.pmml", with_repr=True)

In case you do require some preprocessing on your data, instead of None you can use any other transformator (e.g. LabelBinarizer). But the preprocessing has to be happening inside the pipeline in order to be included in the PMML.