Equivalent of predict_proba for DecisionTreeRegres

2019-07-25 08:51发布

问题:

scikit-learn's DecisionTreeClassifier supports predicting probabilities of each class via the predict_proba() function. This is absent from DecisionTreeRegressor:

AttributeError: 'DecisionTreeRegressor' object has no attribute 'predict_proba'

My understanding is that the underlying mechanics are pretty similar between decision tree classifiers and regressors, with the main difference being that predictions from the regressors are calculated as means of potential leafs. So I'd expect it to be possible to extract the probabilities of each value.

Is there another way to simulate this, e.g. by processing the tree structure? The code for DecisionTreeClassifier's predict_proba wasn't directly transferable.

回答1:

You can get that data out of tree structure:

import sklearn
import numpy as np
import graphviz
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import make_regression

# Generate a simple dataset
X, y = make_regression(n_features=2, n_informative=2, random_state=0)
clf = DecisionTreeRegressor(random_state=0, max_depth=2)
clf.fit(X, y)
# Visualize the tree
graphviz.Source(sklearn.tree.export_graphviz(clf)).view()

>>> clf.predict(X[:5])

0     184.005667
1      53.017289
2     184.005667
3     -20.603498
4     -97.414461

If you call clf.apply(X) you will get a node id to which instance belongs to:

array([6, 5, 6, 3, 2, 5, 5, 3, 6, ... 5, 5, 6, 3, 2, 2, 5, 2, 2], dtype=int64)

Merging it together with target variable:

df = pd.DataFrame(np.vstack([y, clf.apply(X)]), index=['y','node_id']).T
    y           node_id
0   190.370562  6.0
1   13.339570   5.0
2   141.772669  6.0
3   -3.069627   3.0
4   -26.062465  2.0
5   54.922541   5.0
6   25.952881   5.0
       ...

Now if you do a groupby on node_id followed by mean you will get the same values as clf.predict(X)

>>> df.groupby('node_id').mean()
                 y
node_id     
2.0     -97.414461
3.0     -20.603498
5.0     53.017289
6.0     184.005667

Which are the values of leafs in our tree:

>>> clf.tree_.value[6]
array([[184.00566679]])

To get node ids for a new data set you need to call

clf.decision_path(X[:5]).toarray()

which shows you an array like this

array([[1, 0, 0, 0, 1, 0, 1],
       [1, 0, 0, 0, 1, 1, 0],
       [1, 0, 0, 0, 1, 0, 1],
       [1, 1, 0, 1, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0]], dtype=int64)

where you need to get the last non-zero element (i.e. a leaf)

>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x:x.nonzero()[0].max(), axis=1)
0    6
1    5
2    6
3    3
4    2
dtype: int64

So if instead of predicting mean you wanted to predict median you would do

>>> pd.DataFrame(clf.decision_path(X[:5]).toarray()).apply(lambda x: x.nonzero()[0].max(
    ), axis=1).to_frame(name='node_id').join(df.groupby('node_id').median(), on='node_id')['y']
0    181.381106
1     54.053170
2    181.381106
3    -28.591188
4    -93.891889


回答2:

This function adapts code from hellpanderr's answer to provide probabilities of each outcome:

from sklearn.tree import DecisionTreeRegressor
import pandas as pd

def decision_tree_regressor_predict_proba(X_train, y_train, X_test, **kwargs):
    """Trains DecisionTreeRegressor model and predicts probabilities of each y.

    Args:
        X_train: Training features.
        y_train: Training labels.
        X_test: New data to predict on.
        **kwargs: Other arguments passed to DecisionTreeRegressor.

    Returns:
        DataFrame with columns for record_id (row of X_test), y 
        (predicted value), and prob (of that y value).
        The sum of prob equals 1 for each record_id.
    """
    # Train model.
    m = DecisionTreeRegressor(**kwargs).fit(X_train, y_train)
    # Get y values corresponding to each node.
    node_ys = pd.DataFrame({'node_id': m.apply(X_train), 'y': y_train})
    # Calculate probability as 1 / number of y values per node.
    node_ys['prob'] = 1 / node_ys.groupby(node_ys.node_id).transform('count')
    # Aggregate per node-y, in case of multiple training records with the same y.
    node_ys_dedup = node_ys.groupby(['node_id', 'y']).prob.sum().to_frame()\
        .reset_index()
    # Extract predicted leaf node for each new observation.
    leaf = pd.DataFrame(m.decision_path(X_test).toarray()).apply(
        lambda x:x.nonzero()[0].max(), axis=1).to_frame(name='node_id')
    leaf['record_id'] = leaf.index
    # Merge with y values and drop node_id.
    return leaf.merge(node_ys_dedup, on='node_id').drop(
        'node_id', axis=1).sort_values(['record_id', 'y'])

Example:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Works better with min_samples_leaf > 1.
res = decision_tree_regressor_predict_proba(X_train, y_train, X_test,
                                            random_state=0, min_samples_leaf=5)
res[res.record_id == 2]
#      record_id       y        prob
#   25         2    20.6    0.166667
#   26         2    22.3    0.166667
#   27         2    22.7    0.166667
#   28         2    23.8    0.333333
#   29         2    25.0    0.166667