Let's suppose that I have a dataframe with two columns in pandas
which resembles the following one:
text label
0 This restaurant was amazing Positive
1 The food was served cold Negative
2 The waiter was a bit rude Negative
3 I love the view from its balcony Positive
and then I am using TfidfVectorizer
from sklearn
on this dataset.
What is the most efficient way to find the top n in terms of TF-IDF score vocabulary per class?
Apparently, my actual dataframe consists of many more rows of data than the 4 above.
The point of my post to find the code which works for any dataframe which resembles the one above; either 4-rows dataframe or 1M-rows dataframe.
I think that my post is related quite a lot to the following posts:
In the following, you can find a piece of code I wrote more than three years ago for a similar purpose. I'm not sure if this is the most efficient way of doing what you're going to do, but as far as I remember, it worked for me.