可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.

Here's a replicated code:

July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'

Can someone point me in the right direction to remove the emoticons using the tm package?

Thank you,

Luis

回答1:

You can use gsub to get rid of all non-ASCII characters.

Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
    "See you soon brother ☮ ",
    "A boring old-fashioned message" ) 

gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place    "
[2] "See you soon brother  "                                  
[3] "A boring old-fashioned message"

Details: You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.

回答2:

you can try this function

iconv(July4th_clean, "latin1", "ASCII", sub="")

Duplicate issue, see post

remove emoticons in R using tm package

问题:

回答1:

回答2:

收藏的人(0)

remove emoticons in R using tm package

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮