I have a multi-line JSON file with records that contain special characters encoded as hexadecimals. Here is an example of a single JSON record:
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
This record is supposed to be {"value":"ıarines Bintıç Ramuçlar"}
, e.g. '"' character are replaced with corresponding hexadecimal \x22 and other special Unicode characters are replaced with one or two hexadecimals (for instance \xC3\xA7 encodes ç, etc.)
I need to convert similar Strings into a regular Unicode String in Scala, so when printed it produced {"value":"ıarines Bintıç Ramuçlar"}
without hexadecimals.
In Python I can easily decode these records with a line of code:
>>> a = "{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"
>>> a.decode("utf-8")
u'{"value":"\u0131arines Bint\u0131\xe7 Ramu\xe7lar"}'
>>> print a.decode("utf-8")
{"value":"ıarines Bintıç Ramuçlar"}
But in Scala I can't find a way to decode it. I unsuccessfully tried to convert it like this:
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(new String(a.getBytes(), "UTF-8"))
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
I also tried URLDecoder as I found in solution for similar problem (but with URL):
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(java.net.URLDecoder.decode(a.replace("\\x", "%"), "UTF-8"))
{"value":"ıarines Bintıç Ramuçlar"}
It produced the desired result for this example but is seems not safe for generic text fields since it designed to work with URLs and requires replacing all \x
to %
in the string.
Does Scala have some better way to deal with this issue?
I am new to Scala and will be thankful for any help
UPDATE:
I have made a custom solution with javax.xml.bind.DatatypeConverter.parseHexBinary
. It works for now, but it seems cumbersome and not at all elegant. I think there should be a simpler way to do this.
Here is the code:
import javax.xml.bind.DatatypeConverter
import scala.annotation.tailrec
import scala.util.matching.Regex
def decodeHexChars(string: String): String = {
val regexHex: Regex = """\A\\[xX]([0-9a-fA-F]{1,2})(.*)""".r
def purgeBuffer(buffer: String, acc: List[Char]): List[Char] = {
if (buffer.isEmpty) acc
else new String(DatatypeConverter.parseHexBinary(buffer)).reverse.toList ::: acc
}
@tailrec
def traverse(s: String, acc: List[Char], buffer: String): String = s match {
case "" =>
val accUpdated = purgeBuffer(buffer, acc)
accUpdated.foldRight("")((str, b) => b + str)
case regexHex(chars, suffix) =>
traverse(suffix, acc, buffer + chars)
case _ =>
val accUpdated = purgeBuffer(buffer, acc)
traverse(s.tail, s.head :: accUpdated, "")
}
traverse(string, Nil, "")
}
Each
\x??
encodes one byte, like\x22
encodes"
and\x5C
encodes\
. But in UTF-8 some characters are encoded using multiple bytes, so you need to transform\xC4\xB1
toı
symbol and so on.replaceAllIn
is really nice, but it might eat your slashes. So, if you don't use groups (like\1
) in a replaced string,quoteReplacement
is a recommended way to escape\
and$
symbols.P.S. Does anyone know the difference between
java.util.regex.Matcher.quoteReplacement
andscala.util.matching.Regex.quoteReplacement
?The problem is that encoding is really specific to python (i think). Something like this might work: