convert utf8 code point strings like to utf8

2019-06-11 17:10发布

问题:

I have a text file which contains some kind of fallback conversions of Unicode characters (the Unicode code points in angle brackets). So it contains e.g. foo<U+017E>bar which should be "foošbar". Is there an easy way in R to convert the whole file to UTF8 with these characters converted? Unfortunately I am on Windows and can't find a supported UTF-8 locale.

回答1:

Perhaps:

library(stringi)
library(magrittr)

"foo<U+0161>bar and cra<U+017E>y" %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
## [1] "foošbar and cražy"

may work (I don't need the last conversion on macOS but you may on Windows).



回答2:

The previous answer should work when the code point is presented with exactly four digits. Here is a modified version that should work for any number of digits between 1 and 8.

library(stringi)
library(magrittr)

"foo<U+0161>bar and cra<U+017E>y, Phoenician letter alf <U+10900>" %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{4})>", "\\\\u$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{5})>", "\\\\U000$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{6})>", "\\\\U00$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{7})>", "\\\\U0$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{8})>", "\\\\U$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{1})>", "\\\\u000$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{2})>", "\\\\u00$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{3})>", "\\\\u0$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
## [1] "foošbar and cražy, Phoenician letter alf               
                            
标签: r utf-8