“Stop words” list for English? [closed]

I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the".

Where can I find some lists of these uninteresting words?
Is a list of these words the same as a list of the most frequently used words in English?

update: these are apparently called "stop words" and not "skip words".

标签： language-agnostic indexing filtering stop-words nlp

6条回答

虎瘦雄心在

2楼-- · 2019-01-16 08:59

Typically these words will appear in documents with the highest frequency. Assuming you have a global list of words:

{ Word Count }

With the list of words, if you ordered the words from the highest count to the lowest, you would have a graph (count (y axis) and word (x axis) that is the inverse log function. All of the stop words would be at the left, and the stopping point of the "stop words" would be at where the highest 1st derivative exists.

This solution is better than a dictionary attempt:

This solution is a universal approach that is not bound by language
This attempt learns what words are deemed to be "stop words"
This attempt will produce better results for collections that are very similar, and produce unique word listings for items in the collections
The stop words can be recalculated at a later time (with this there can be caching and a statistical determination that the stop words may have changed from when they were calculated)
This can also eliminate time based or informal words and names (such as slang, or if you had a bunch of documents that had a company name as a header)

The dictionary attempt is better:

The lookup time is much faster
The results are precached
Its simple
Some else came up with the stop words.

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-01-16 09:02

Get statistics about word frequency in large txt corpora. Ignore all words with frequency > some number.

0人赞添加讨论(0) 举报

该账号已被封号

4楼-- · 2019-01-16 09:03

The magic word to put into Google is "stop words". This turns up a reasonable-looking list.

MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.

0人赞添加讨论(0) 举报

\"骚年 ilove

5楼-- · 2019-01-16 09:08

these are called stop words, check this sample

0人赞添加讨论(0) 举报

Root（大扎）

6楼-- · 2019-01-16 09:10

Depending on the subdomain of English you are working in, you may have/wish to compile your own stop word list. Some generic stop words could be meaningful in a domain. E.g. The word "are" could actually be an abbreviation/acronym in some domain. Conversely, you may want to ignore some domain specific words depending on your application which you may not want to ignore in the domain of general English. E.g. If you are analyzing a corpus of hospital reports, you may wish to ignore words like 'history' and 'symptoms' as they would be found in every report and may not be useful (from a plain vanilla inverted index perspective).

Otherwise, the lists returned by Google should be fine. The Porter Stemmer uses this and the Lucene seach engine implementation uses this.

0人赞添加讨论(0) 举报

Melony?

7楼-- · 2019-01-16 09:10

I think I used the stopword list for German from here when I built a search application with lucene.net a while ago. The site contains a list for English, too, and the lists on the site are apparaently the ones that the lucene project use as default, too.

0人赞添加讨论(0) 举报

“Stop words” list for English? [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间