Javascript Unicode: same letters but different uni

2020-04-09 16:11发布

I've got to send text to a printservice, which only accepts certain types of special characters, i.e. ï. My client somehow inputs text in such a way that the letters look the same, but have a different underlying unicode symbol, and are thereby not processed correctly by the printservice. Example:

Mine: ï (unicode \u00EF)
Theirs: ï (unicode \u0069\u0308), copy pasting the 2 symbols in chrome bar for example, will show that it actually looks the same in textarea's)

How can I convert all special characters from "their style" to "my style" (dutch keyboard layout on Windows)? I guess this has something to do with OS or keyboard layouts, but I cannot find a list stating the differences, or anything related to this issue. Does someone has a suggestion how to proceed?

2条回答
再贱就再见
2楼-- · 2020-04-09 16:21

As correctly pointed out in the comments, there are two ways (or "normalization forms") to represent accented characters in unicode:

  • with a dedicated symbol (\u00EF == ï)
  • with a composition of the basic letter + accent (i.e. i + ¨ == i + \u0308 == ï)

ES6 adds a dedicated function, which converts strings between normalization forms : String.normalize.

// convert one-char ("composed") to multiple-chars ("decomposed") form:
escape("\u00EF".normalize("NFD"))  
> "i%u0308"

// convert decomposed form to composed:
escape("i\u0308".normalize("NFC"))  
> "%EF"

If your system doesn't support normalize yet, look around for shims.

查看更多
放我归山
3楼-- · 2020-04-09 16:34

\u00EF is ï or the Latin Small Letter I with Diaeresis (and \u0020 is the Space character)

\u0069\u0308 is the Latin Small Letter I followed by the Combining Diaeresis

Normalization is needed to transform the second, two-character sequence into the first. You will need to find some utility to perform this normalization before you send to your print service.

See JavaScript Unicode normalization for some options.

查看更多
登录 后发表回答