为什么从预处理方法我输出sklearn.pipeline未对齐?(Why my output fro

2019-09-26 06:16发布

我学习的书“手把手学习机”创作转型管线一些代码清理我的数据,发现同一流水线方法的输出,根据我选择了输入数据框的大小而变化。 下面是代码:

from sklearn.base import BaseEstimator,TransformerMixin    
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
      self.attribute_names =attribute_names
    def fit(self,X,y=None):
      return self
    def transform(self,X):
      return X[self.attribute_names].values

from sklearn.pipeline import FeatureUnion

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
      self.sparse_output = sparse_output
    def fit(self, X, y=None):
      return self
    def transform(self, X, y=None):
      enc = LabelBinarizer(sparse_output=self.sparse_output)
      return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(housing)
data_prepared = full_pipeline.transform(housing.iloc[:5])
data_prepared1 = full_pipeline.transform(housing.iloc[:1000])
data_prepared2 = full_pipeline.transform(housing.iloc[:10000])
print(data_prepared.shape)
print(data_prepared1.shape)
print(data_prepared2.shape)

这三个打印的输出将是(5,14)(1000,15)(10000,16)谁能帮我解释一下吗?

Answer 1:

那是因为,在CustomLabelBinarizer你每次打电话拟合LabelBinarizer来transform()所以它会根据行数列的每一次,因此不同数量的学习不同的标签,在每次运行。

改变这样的:

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
      self.sparse_output = sparse_output
    def fit(self, X, y=None):
      self.enc = LabelBinarizer(sparse_output=self.sparse_output)
      self.enc.fit(X)
      return self
    def transform(self, X, y=None):
      return self.enc.transform(X)

而现在我得到您的代码正确的形状:

(5, 14)
(1000, 14)
(10000, 14)

注意 :相同的问题已被要求在这里 。 我假设你正在使用此链接的代码。 如果您使用的任何其他网站,它可能是代码中有一个旧版本的代码我联系。 试试上面的链接无差错更新版本的代码。



文章来源: Why my output from preprocessing methods in sklearn.pipeline does not align?