In my JavaScript code I need to compose a message to server in this format:
<size in bytes>CRLF
<data>CRLF
Example:
3
foo
The data may contain unicode characters. I need to send them as UTF-8.
I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript.
I've tried this to compose my payload:
return unescape(encodeURIComponent(str)).length + "\n" + str + "\n"
But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?).
Any clues?
Update:
Example: length in bytes of the string ЭЭХ! Naïve?
in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.
There is no way to do it in JavaScript natively.
If you know the character encoding, you can calculate it yourself though.
encodeURIComponent
assumes UTF-8 as the character encoding, so if you need that encoding, you can do,This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.
The table in wikipedia makes it clearer
If instead you need to understand the page encoding, you can use this trick:
Here is an independent and efficient method to count UTF-8 bytes of a string.
Note that the method may throw error if an input string is UCS-2 malformed
This would work for BMP and SIP/SMP characters.
Actually, I figured out what's wrong. For the code to work the page
<head>
should have this tag:Or, as suggested in comments, if server sends HTTP
Content-Encoding
header, it should work as well.Then results from different browsers are consistent.
Here is an example:
Note: I suspect that specifying any (accurate) encoding would fix the encoding problem. It is just a coincidence that I need UTF-8.
Here is a much faster version, which doesn't use regular expressions, nor encodeURIComponent:
Here is a performance comparison.
It just computes the length in UTF8 of each unicode codepoints returned by charCodeAt (based on wikipedia's descriptions of UTF8, and UTF16 surrogate characters).
It follows RFC3629 (where UTF-8 characters are at most 4-bytes long).
This function will return the byte size of any UTF-8 string you pass to it.
Source