What is the difference between TfidfVectorizer.fit

In Tfidf.fit_transform we are only using the parameters X and have not used y for fitting the data set. Is this right? We are generating the tfidf matrix for only parameters of the training set.We are not using ytrain in fitting the model. Then how do we make predictions for the test data set

标签： python scikit-learn nlp tfidfvectorizer

1条回答

倾城　Initia

2楼-- · 2019-08-04 10:54

https://datascience.stackexchange.com/a/12346/122 has a good explanation of why it's call fit(), transform() and fit_transform().

In gist,

fit(): Fit the vectorizer/model to the training data and save the vectorizer/model to a variable (returns sklearn.feature_extraction.text.TfidfVectorizer)
transform(): Use the variable output from fit() to transformer validation/test data (returns scipy.sparse.csr.csr_matrix)
fit_transform(): Sometimes you to directly transform the training data, so you use fit() + transform() together, thus fit_transform(). (returns scipy.sparse.csr.csr_matrix)

E.g.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix


# The *TfidfVectorizer* from sklearn expects list of strings as input.
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower()
sent1 = "Mr brown jumps over the lazy fox .".lower()
sent2 = "Roses are red , the chocolates are brown .".lower()
sent3 = "The frank dog jumps through the red roses .".lower()

dataset = [sent0, sent1, sent2, sent3]

# Initialize the parameters of the vectorizer
vectorizer = TfidfVectorizer(input=dataset, analyzer='word', ngram_range=(1,1),
                     min_df = 0, stop_words=None)

[out]:

# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer =  vectorizer.fit(dataset)

# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0.        , 0.31342551, 0.        , 0.38714286, 0.        ,
        0.        , 0.31342551, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])

# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
    with 28 stored elements in Compressed Sparse Row format>

>>> vectorizer.fit_transform(dataset).toarray()
array([[0.        , 0.49642852, 0.        , 0.30659399, 0.30659399,
        0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
        0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
       [0.        , 0.32107915, 0.        , 0.        , 0.39659663,
        0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
        0.        , 0.        , 0.        , 0.26250325, 0.        ],
       [0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
       [0.        , 0.        , 0.        , 0.34049544, 0.        ,
        0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
        0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])


>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>

>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

0人赞添加讨论(0) 举报

What is the difference between TfidfVectorizer.fit

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间