remove emoticons in R using tm package

2019-07-14 12:41发布

I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.

Here's a replicated code:

July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'

Can someone point me in the right direction to remove the emoticons using the tm package?

Thank you,

Luis

2条回答
我想做一个坏孩纸
2楼-- · 2019-07-14 13:02

you can try this function

iconv(July4th_clean, "latin1", "ASCII", sub="")

Duplicate issue, see post

查看更多
Ridiculous、
3楼-- · 2019-07-14 13:12

You can use gsub to get rid of all non-ASCII characters.

Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
    "See you soon brother ☮ ",
    "A boring old-fashioned message" ) 

gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place    "
[2] "See you soon brother  "                                  
[3] "A boring old-fashioned message"

Details: You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.

查看更多
登录 后发表回答