Extract English words from a text in R

2019-01-26 00:36发布

I have a text and I need to extract all English words from it. For instance I want to have a function which would analyse the vector

vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")

And return only English words from this vector i.e. "picture", "carpet", "lamp"

I do understand that the definition of "English word" depends on the dictionary but I would be satisfied even with a basic dictionary.

标签： r text word

1条回答

一纸荒年 Trace。

2楼-- · 2019-01-26 01:10

You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:

vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")

library(qdapDictionaries)
vector[vector %in% GradyAugmented]

## [1] "picture" "carpet"  "lamp"

intersect(vector, GradyAugmented)

## [1] "picture" "carpet"  "lamp"

The error you are receiving with installing qdap sounds like @Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table") to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT (a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.

0人赞添加讨论(0) 举报

Extract English words from a text in R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间