I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!
For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测
. This is a lossy mangling: you can't tell whether the user typed literally 测
or 测
. And if you are displaying the submitted content 测
as 测
then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...