Special character '\\u0098' read as '\

2019-07-12 12:39发布

问题:

I am creating test.js from Java, as per below. Test.js implements function d(), that receives as parameter special character ˜ ('\u0098');

Function d() should display the charCodeAt() of this special characters, that would be 152. However, it displays 732.

Please note that characters 152 and 732 are both represented by special character ˜, as per below.

http://www.fileformat.info/info/unicode/char/098/index.htm

http://www.fileformat.info/info/unicode/char/2dc/index.htm

How can I force function d() to display 152 instead of 732? (charset issue?). THANKS

TEST.JAVA

public void doPost(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException
{
    res.setHeader("Content-Type", "text/javascript;charset=ISO-8859-1");
    res.setHeader("Content-Disposition","attachment;filename=test.js");
    res.setCharacterEncoding("ISO-8859-1");
    PrintWriter printer=res.getWriter();
    printer.write("function d(a){a=(a+\"\").split(\"\");alert(a[0].charCodeAt(0));};d(\""); // Writes beginning of d() function
    printer.write('\u0098'); // Writes special character as parameter of d()
    printer.write("\");"); // Writes end of d() function
    printer.close();
}

TEST.JS created by TEST.JAVA

function d(a)
{
  a=(a+"").split("");
  alert(a[0].charCodeAt(0));
};
d("˜"); // Note special character representing '\u0098'

TEST.HTML

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
<body>
<script type="text/javascript" charset="ISO-8859-1" src="test.js"></script>
</body>
</html>

回答1:

Please note that characters 152 and 732 are both represented by special character ˜, as per below.

Not really. ˜ is unequivocally character U+02DC (732), so charCodeAt is doing the right thing. Character U+0098 (152) is an invisible control code that is almost never used.

The trick is that "ISO-8859-1" has a different meaning to Java and web browsers. For Java it really is the ISO-8859-1 standard, which maps exactly to the first 256 code points of Unicode. That includes a range of little-used C1 control characters at 128–159.

However for a web browser, "ISO-8859-1" actually means Windows code page 1252 (Western European), an encoding that puts assorted useful characters in the 128–159 block instead. This behaviour stems from early web browsers that just used the machine default code page. When proper Unicode and encoding support was added to browsers, compatibility concerns dictated continued support for the Windows characters despite their incorrect labelling as an ISO-8859 format.

So when you write a U+0098 character from Java in ISO-8859-1, you get an 0x98 byte, which then gets read in by the browser as U+02DC. This is normally harmless because no-one actually ever wants to use the C1 control codes in the range U+0080–U+009F. But it certainly is confusing.

This ancient quirk, along with the related one of treating &#...; character references in the range 128–159 as being cp1252 bytes, is finally documented and standardised as part of HTML5, but for the HTML parsing rules only. (Not XHTML5 as that follows the more sensible XML rules.) This is why the quoted fileformat.info page appears to say, misleadingly, that U+0098 is rendered like ˜.

If you really need to extract the cp1252 byte number of a character, you would have to use a look-up table to help you, because that information is not made visible to JavaScript. For example:

var CP1252EXTRAS= '\u20ac\u20ac\u201a\u0192\u201e\u2026\u2020\u2021\u02c6\u2030\u0160\u2039\u0152\u0152\u017d\u017d\u017d\u2018\u2019\u201c\u201d\u2022\u2013\u2014\u02dc\u2122\u0161\u203a\u0153\u0153\u017e\u0178';

function getCodePage1252Byte(s) {
    var ix= CP1252EXTRAS.indexOf(s);
    if (ix!==-1)
        return 128+ix;
    var c= s.charCodeAt(0);
    if (c<128 || c>=160 && c<256)
        return c;
    return -1;
}

You probably don't want to do that. Anyhow, normally the answer is not to use ISO-8859-1, but to stick to good old UTF-8 (The Only Sensible Encoding™).

In any case, <script charset="..."> isn't supported by every browser, and Content-Type: text/javascript;charset=... is also not supported by every browser. There is not a reliable way of serving JavaScript under a different encoding to the including page. If you are not 100% every including page will be using the same encoding as your script, the only safe way forward is to keep your JavaScript ASCII-safe, outputting JavaScript \unnnn sequences instead of literal bytes.

(An ASCII-compatible JSON encoder may help you do this.)



回答2:

Try:

    printer.write('\\u0098');

JavaScript understands \uNNNN too, so you can explicitly form the string with the character code you want.