I may not be using the appropriate language in the title. If this needs edited please feel free.
I want to take a string with "byte"
substitutions for unicode characters and convert them back to unicode. Let's say I have:
x <- "bi<df>chen Z<fc>rcher hello world <c6>"
I'd like to get back:
"bißchen Zürcher hello world Æ"
I know that if I could get it to this form it would print to the console as desired:
"bi\xdfchen Z\xfcrcher \xc6"
I tried:
gsub("<([[a-z0-9]+)>", "\\x\\1", x)
## [1] "bixdfchen Zxfcrcher xc6"
How about this:
x <- "bi<df>chen Z<fc>rcher hello world <c6>"
m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})
regmatches(x, m) <- chars
x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"
Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"
Note that you can't make an escaped character by pasting a "\x" to the front of a number. That "\x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar()
to turn a number into the character we want.
I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.
You can also use the gsubfn
library.
library(gsubfn)
f <- function(x) rawToChar(as.raw(as.integer(paste0("0x", x))), multiple=T)
gsubfn("<([0-9a-f]{2})>", f, "bi<df>chen Z<fc>rcher hello world <c6>")
## [1] "bißchen Zürcher hello world Æ"