Turn Unicode into Umlaut in R on Mac (Facebook Dat

2019-05-25 07:35发布

I did a lot of research on this and I still can't find a solution to this.

I have extracted data from a German Facebook group that looks like

from_ID         from_name           message                                        created_time
12334543        Max Muster          Dies war auch eine sehr sch<U+00F6>ne Bucht    2016-01-08T19:00:54+0000

I understand that <U+00F6> stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language).

No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is that training data is not containing these Unicodes and hence the prediction/classification goes wrong. In case of other text based procedures text cleaning like stopword removal is a problem as stop word lists are also "clean" and do not feature these codes.

Is there any easy way to get rid of this and to make R display the corresponding sign instead of the code?

I tried a lot. My last resort would be a gsub routine. However my data frame includes more than 1 million comments. In addition gsub would be very painful as there seems to be too many Unicodes (if we think of more languages than German).

If I got it right it is also important what kind of computer I am using. It is a MacBook Pro.

Any help here is really really appreciated!!

Thank you a lot for your time and help!

1条回答
趁早两清
2楼-- · 2019-05-25 08:09

It's a bit mystifying, but this will do it:

message <- c("Dies war auch eine sehr sch<U+00F6>ne Bucht", 
             "Schlo<U+00DF> Sch<U+00F6>nbrunn.")

# convert the <U+00xx> format to R's \\u00xx format for escaped Unicode
message2 <- stringi::stri_replace_all_fixed(message, c("<U+", ">"), c("\\u", ""), vectorize_all = FALSE)

# convert to native through parsing and coercing
as.character(parse(text = shQuote(message2)))
## [1] "Dies war auch eine sehr schöne Bucht" "Schloß Schönbrunn." 
查看更多
登录 后发表回答