How many bytes in a JavaScript string?

2019-01-04 08:53发布

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?

I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?

12条回答
女痞
2楼-- · 2019-01-04 09:20

If you're using node.js, there is a simpler solution using buffers :

function getBinarySize(string) {
    return Buffer.byteLength(string, 'utf8');
}

There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)

查看更多
成全新的幸福
3楼-- · 2019-01-04 09:24

You can use the Blob to get the string size in bytes.

Examples:

console.info(
  new Blob(['                                                                    
查看更多
叛逆
4楼-- · 2019-01-04 09:26

The size of a JavaScript string is

  • Pre-ES6: 2 bytes per character
  • ES6 and later: 2 bytes per character, or 5 or more bytes per character

Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.

ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}

Practical notes

  • This doesn't relate to the internal implemention of a particular engine. For example, some engines use data structures and libraries with full UTF-16 support, but what they provide externally doesn't have to be full UTF-16 support. Also an engine may provide external UTF-16 support as well but is not mandated to do so.

  • For ES6, practically speaking characters will never be more than 5 bytes long (2 bytes for the escape point + 3 bytes for the Unicode code point) because the latest version of Unicode only has 136,755 possible characters, which fits easily into 3 bytes. However this is technically not limited by the standard so in principal a single character could use say, 4 bytes for the code point and 6 bytes total.

  • Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.

查看更多
兄弟一词,经得起流年.
5楼-- · 2019-01-04 09:26

You can try this:

  var b = str.match(/[^\x00-\xff]/g);
  return (str.length + (!b ? 0: b.length)); 

It worked for me.

查看更多
趁早两清
6楼-- · 2019-01-04 09:27

Note that if you're targeting node.js you can use Buffer.from(string).length:

var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
查看更多
手持菜刀,她持情操
7楼-- · 2019-01-04 09:27

The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.

byteCount(String.fromCharCode(55555))
// URIError: URI malformed

This longer function should handle all strings:

function bytes (str) {
  var bytes=0, len=str.length, codePoint, next, i;

  for (i=0; i < len; i++) {
    codePoint = str.charCodeAt(i);

    // Lone surrogates cannot be passed to encodeURI
    if (codePoint >= 0xD800 && codePoint < 0xE000) {
      if (codePoint < 0xDC00 && i + 1 < len) {
        next = str.charCodeAt(i + 1);

        if (next >= 0xDC00 && next < 0xE000) {
          bytes += 4;
          i++;
          continue;
        }
      }
    }

    bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
  }

  return bytes;
}

E.g.

bytes(String.fromCharCode(55555))
// 3

It will correctly calculate the size for strings containing surrogate pairs:

bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)

The results can be compared with Node's built-in function Buffer.byteLength:

Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3

Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
查看更多
登录 后发表回答