Why my output from preprocessing methods in sklear

2019-08-04 13:22发布

问题:

I am learning the book "Hands On Machine Learning" and writing some code of transformation pipelines to clean up my data and find the output of the same pipeline method varies according to the size of the dataframe I chose to input. Here is the code:

from sklearn.base import BaseEstimator,TransformerMixin    
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
      self.attribute_names =attribute_names
    def fit(self,X,y=None):
      return self
    def transform(self,X):
      return X[self.attribute_names].values

from sklearn.pipeline import FeatureUnion

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
      self.sparse_output = sparse_output
    def fit(self, X, y=None):
      return self
    def transform(self, X, y=None):
      enc = LabelBinarizer(sparse_output=self.sparse_output)
      return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(housing)
data_prepared = full_pipeline.transform(housing.iloc[:5])
data_prepared1 = full_pipeline.transform(housing.iloc[:1000])
data_prepared2 = full_pipeline.transform(housing.iloc[:10000])
print(data_prepared.shape)
print(data_prepared1.shape)
print(data_prepared2.shape)

The output of these three print will be (5, 14) (1000, 15) (10000, 16) Could anyone help me explain this?

回答1:

Thats because, in the CustomLabelBinarizer you are fitting the LabelBinarizer on each call to transform(), so it will learn different labels each time and hence different number of columns in each run depending on the number of rows.

Change that to this:

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
      self.sparse_output = sparse_output
    def fit(self, X, y=None):
      self.enc = LabelBinarizer(sparse_output=self.sparse_output)
      self.enc.fit(X)
      return self
    def transform(self, X, y=None):
      return self.enc.transform(X)

And now I'm getting correct shape on your code:

(5, 14)
(1000, 14)
(10000, 14)

Note: The same issue has been asked here. I'm assuming you are using the link here for the code. If you are using any other website, its likely that the code there is an old version of code I linked. Try the code on the above link for an error free updated version.