I did a lot of research on this and I still can't find a solution to this.
I have extracted data from a German Facebook group that looks like
from_ID from_name message created_time
12334543 Max Muster Dies war auch eine sehr sch<U+00F6>ne Bucht 2016-01-08T19:00:54+0000
I understand that <U+00F6>
stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language).
No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is that training data is not containing these Unicodes and hence the prediction/classification goes wrong. In case of other text based procedures text cleaning like stopword removal is a problem as stop word lists are also "clean" and do not feature these codes.
Is there any easy way to get rid of this and to make R display the corresponding sign instead of the code?
I tried a lot. My last resort would be a gsub routine. However my data frame includes more than 1 million comments. In addition gsub would be very painful as there seems to be too many Unicodes (if we think of more languages than German).
If I got it right it is also important what kind of computer I am using. It is a MacBook Pro.
Any help here is really really appreciated!!
Thank you a lot for your time and help!
It's a bit mystifying, but this will do it: