I have a pandas dataframe with column text
consists of news articles
. Given as:-
text
article1
article2
article3
article4
I have calculated the Tf-IDF values for articles as:-
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
matrix_1 = tfidf.fit_transform(df['text'])
As my dataframe is kept updating from time to time. So, let's say after calculating of-if as matrix_1 my dataframe got updated with more articles. Something like:
text
article1
article2
article3
article4
article5
article6
article7
As I have millions of articles and all I want to store a tf-IDF matrix of the previous article and updating the same with tf-IDF scores of the new article. Running the of-IDF code for all articles, again and again, would be memory consuming. Is there any way I can perform this?
I haven't tested this code but I feel that this should work.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame()
while True:
if not len(df):
# When you dataframe is populated for the very first time
tfidf = TfidfVectorizer()
matrix_1 = tfidf.fit_transform(df['text'].iloc[last_len:])
last_len = len(df)
else:
# When you dataframe is populated again and again
# If you have to use earlier fitted model
matrix_1 = np.vstack(matrix_1, tfidf.transform(df['text'].iloc[last_len:]))
# If you have to update tf-idf every time which is kinda doesn't make sense
matrix_1 = np.vstack(matrix_1, tfidf.fit_transform(df['text'].iloc[last_len:]))
last_len = len(df)
# TO-DO Some break condition according to your case
#####
If the duration between dataframe updates is longer than you can use pickle on matrix_1 to store intermediate results.
However what I feel is using tfidf.fit_transform(df['text'])
again and again on different inputs will not give you any meaningful results or may be I misunderstood. Cheers!!