I have a text file which contains some kind of fallback conversions of Unicode characters (the Unicode code points in angle brackets). So it contains e.g. foo<U+017E>bar
which should be "foošbar". Is there an easy way in R to convert the whole file to UTF8 with these characters converted? Unfortunately I am on Windows and can't find a supported UTF-8 locale.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Perhaps:
library(stringi)
library(magrittr)
"foo<U+0161>bar and cra<U+017E>y" %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
## [1] "foošbar and cražy"
may work (I don't need the last conversion on macOS but you may on Windows).
回答2:
The previous answer should work when the code point is presented with exactly four digits. Here is a modified version that should work for any number of digits between 1 and 8.
library(stringi)
library(magrittr)
"foo<U+0161>bar and cra<U+017E>y, Phoenician letter alf <U+10900>" %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{4})>", "\\\\u$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{5})>", "\\\\U000$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{6})>", "\\\\U00$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{7})>", "\\\\U0$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{8})>", "\\\\U$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{1})>", "\\\\u000$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{2})>", "\\\\u00$1") %>%
stri_replace_all_regex("<U\\+([[:alnum:]]{3})>", "\\\\u0$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
## [1] "foošbar and cražy, Phoenician letter alf