Implementation of sklearn.impute.IterativeImputer

2020-06-22 02:59发布

问题:

Consider data which contains some nan below:

Column-1    Column-2    Column-3    Column-4    Column-5
0   NaN 15.0    63.0    8.0 40.0
1   60.0    51.0    NaN 54.0    31.0
2   15.0    17.0    55.0    80.0    NaN
3   54.0    43.0    70.0    16.0    73.0
4   94.0    31.0    94.0    29.0    53.0
5   99.0    52.0    77.0    91.0    58.0
6   84.0    19.0    36.0    NaN 97.0
7   41.0    91.0    62.0    67.0    68.0
8   44.0    38.0    27.0    53.0    37.0
9   58.0    NaN 63.0    57.0    28.0
10  66.0    68.0    89.0    36.0    47.0
11  7.0 81.0    5.0 99.0    16.0
12  43.0    55.0    64.0    88.0    NaN
13  8.0 90.0    91.0    44.0    4.0
14  29.0    52.0    94.0    71.0    47.0
15  22.0    21.0    68.0    61.0    38.0
16  76.0    36.0    70.0    99.0    50.0
17  38.0    31.0    66.0    79.0    99.0
18  94.0    22.0    92.0    39.0    58.0

I want to replace nan in the data using sklearn.impute.IterativeImputer. A friend helped me with the code below:

imp = IterativeImputer(missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)
imputed_data = pd.DataFrame(data=imp.transform(data), 
                             columns=['Column-1', 'Column-2', 'Column-3', 'Column-4', 'Column-5'],
                             dtype='int')

The imputed_data is:


Column-1    Column-2    Column-3    Column-4    Column-5
0   59  15  63  8   40
1   60  51  66  54  31
2   15  17  55  80  48
3   54  43  70  16  73
4   94  31  94  29  53
5   99  52  77  91  58
6   84  19  36  59  97
7   41  91  62  67  68
8   44  38  27  53  37
9   58  46  63  57  28
10  66  68  89  36  47
11  7   81  5   99  16
12  43  55  64  88  47
13  8   90  91  44  4
14  29  52  94  71  47
15  22  21  68  61  38
16  76  36  70  99  50
17  38  31  66  79  99
18  94  22  92  39  58

From the IterativeImputer documentation, the default estimator is BayesianRidge(). But if I use other estimators such as estimator=ExtraTreesRegressor(n_estimators=10, random_state=0) like in the code below, it returns a warning message. The code:

imp = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=0), missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)

The message:

C:\Users\...\sklearn\impute\_iterative.py:599: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached. " reached.", ConvergenceWarning).

My question: is this a correct approach or should I do something to fix the warning message?
Thank you.

回答1:

They are having the same issue here:

https://github.com/scikit-learn/scikit-learn/issues/14338



回答2:

You are getting this error because of the parameters max_iter=10 & tol=0.001set for IterativeImputer().

The stopping criterion (abs(max(X_t - X_{t-1}))/abs(max(X[known_vals])) < tol) is not met for 10 number of iterations(max_iter=10).

Refer to the description of max_iter in the parameters section of sklearn.impute.IterativeImputer documentation.

One workaround to overcome this error is setting the max_iter parameter value higher.



回答3:

Have you tried to import ExtraTreesRegressor first. It should work fine.

from sklearn.ensemble import ExtraTreesRegressor.

Also check for the version of scikit learn. It should be 0.21.1 and above.