How to get bag of words from textual data? [closed

I am working on prediction problem using a large textual dataset. I am implementing Bag of Words Model.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting bag of words? Or should I use some other algorithms. I am using python.

标签： python machine-learning text-processing

5条回答

▲ chillily

2楼-- · 2020-05-19 00:44

Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps

tokenizing
counting
normalizing

Limitations to keep in mind: 1. Cannot capture phrases or multi-word expressions 2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,

e.g.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2020-05-19 00:45

Bag-of-words model is a nice method for text representation to be applied in different machine learning tasks. But in the first step you need to clean up data from unnecessary data for example punctuation, html tags, stop-words,... For these tasks you may can easily exploit libraries like Beautiful Soup (to remove HTML Markups) or NLTK (to remove stop-words) in Python. After cleaning your data you need to create a vector features (numerical representation of data for machine learning) this is where Bag-Of-Words plays the role. scikit-learn has a module (feature_extraction module) which can help you create the bag-of-words features.

You may find all you need in detail in this tutorial also this one can be very helpful. I found both of them very useful.

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2020-05-19 00:46

From a book "Machine learning python":

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['blablablatext'])
bag = count.fit_transform(docs)

0人赞添加讨论(0) 举报

\"骚年 ilove

5楼-- · 2020-05-19 00:55

As others already mentioned, using nltk would be your best option if you want something stable, and scalable. It's highly configurable.

However, it has the downside of having a quite steep learning curve, if you want to tweak the defaults.

I once encountered a situation where I wanted to have a bag of words. Problem was, it concerned articles about technologies with exotic names full of -, _, etc. Such as vue-router or _.js etc.

The default configuration of nltk's word_tokenize is to split vue-router into two separate vue and router words, for instance. I'm not even talking about _.js.

So for what it's worth, I ended up writing this little routine to get all the words tokenized into a list, based on my own punctuation criteria.

import re

punctuation_pattern = ' |\.$|\. |, |\/|\(|\)|\'|\"|\!|\?|\+'
text = "This article is talking about vue-router. And also _.js."
ltext = text.lower()
wtext = [w for w in re.split(punctuation_pattern, ltext) if w]

print(wtext)
# ['this', 'article', 'is', 'talking', 'about', 'vue-router', 'and', 'also', '_.js']

This routine can be easily combined with Patty3118 answer about collections.Counter, which could lead you to know which number of times _.js was mentioned in the article, for instance.

0人赞添加讨论(0) 举报

SAY GOODBYE

6楼-- · 2020-05-19 00:59

Using the collections.Counter class

>>> import collections, re
>>> texts = ['John likes to watch movies. Mary likes too.',
   'John also likes to watch football games.']
>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
            for txt in texts]
>>> bagsofwords[0]
Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})
>>> bagsofwords[1]
Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})
>>> sumbags = sum(bagsofwords, collections.Counter())
>>> sumbags
Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})
>>>

0人赞添加讨论(0) 举报

How to get bag of words from textual data? [closed

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间