Scala - unescape Unicode String without Apache

2019-05-14 01:31发布

I have a String "b\u00f4lovar" and i was wondering if it's possible to unescape without use Commons-lang. It works but i'm facing a problem on some enviroments and i would like to minimize it (i.e.: it works on my machine but not works on production).

StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))

How can i unescape it without apache lib?

Thank in advance.

标签: scala unicode
1条回答
放我归山
2楼-- · 2019-05-14 01:51

Only Unicode escapes

If you want to unescape only sequences in the format \u0000 than it is simple to do it with a single regex replace:

def unescapeUnicode(str: String): String =
  """\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar match {
      case '\\' => """\\"""
      case '$' => """\$"""
      case c => c.toString
    })

And the result is

scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bôlovar シ

We have to process characters $ and \ separately, because they are treated as special by the java.util.regex.Matcher.appendReplacement method:

def wrongUnescape(str: String): String =
  """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar.toString)

scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
  at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
  ... 46 elided

scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
   at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
   ... 46 elided

All escape characters

Unicode character escapes are a bit special: they are not a part of string literals, but a part of the program code. There is a separate phase to replace unicode escapes with characters:

scala> Integer.toString('a', 16)
res2: String = 61

scala> val \u0061 = "foo"
a: String = foo

scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = "    " 

There is a function StringContext.treatEscapes in Scala library, that supports all normal escapes from the language specification.

So if you want to support unicode escapes and all normal Scala escapes, you can unescape both sequentially:

def unescape(str: String): String =
  StringContext.treatEscapes(unescapeUnicode(str))

scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b

scala> unescape("\\u005ct")
res5: String = "    "
查看更多
登录 后发表回答