I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.
Here's a replicated code:
July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'
Can someone point me in the right direction to remove the emoticons using the tm package?
Thank you,
Luis
you can try this function
Duplicate issue, see post
You can use
gsub
to get rid of all non-ASCII characters.Details: You can specify character classes in regex's with
[ ]
. When the class description starts with^
it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.