How do I capture utf-8 decode errors in node.js?

I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing any non-utf8 characters with '\ufffd' (the Unicode REPLACEMENT CHARACTER) instead of throwing an exception about the non-utf8 input. As a consequence, fs.readFile, process.stdin.setEncoding and friends mask a large class of bad input errors for you.

Example which doesn't fail but really ought to:

> notValidUTF8 = new Buffer([ 128 ], 'binary')
<Buffer 80>
> decodedAsUTF8 = notValidUTF8.toString('utf8') // no exception thrown here!
'�'
> decodedAsUTF8 === '\ufffd'
true

'\ufffd' is a perfectly valid character that can occur in legal utf8 (as the sequence ef bf bd), so it is non-trivial to monkey-patch in error handling based on this showing up in the result.

Digging a little deeper, it looks like this stems from node just deferring to v8's strings and that those in turn have the above behaviour, v8 not having any external world full of foreign-encoded data.

Are there node modules or otherwise that let me catch utf-8 decode errors, preferrably with context about where the error was discovered in the input string or buffer?

标签： node.js utf-8 error-handling npm utf8-decode

2条回答

狗以群分

2楼-- · 2019-01-26 07:19

As Josh C. said above: "npmjs.org/package/encoding"

From the npm website: "encoding is a simple wrapper around node-iconv and iconv-lite to convert strings from one encoding to another."

Download: $ npm install encoding

Example Usage

var result = encoding.convert(new Buffer([ 128 ], 'binary'), "utf8");
console.log(result); //<Buffer 80>

Visit the site: npm - encoding

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-01-26 07:46

I hope you solved the problem in those years, I had a similar one and eventually solved with this ugly trick:

  function isValidUTF8(buf){
   return Buffer.compare(new Buffer(buf.toString(),'utf8') , buf) === 0;
  }

which converts the buffer back and forth and check it stays the same.

The 'utf8' encoding can be omitted.

Then we have:

> isValidUTF8(new Buffer('this is valid, 指事字 eè we hope','utf8'))
true
> isValidUTF8(new Buffer([128]))
false
> isValidUTF8(new Buffer('\ufffd'))
true

where the '\ufffd' character is correctly considered as valid utf8.

UPDATE: now this works in JXcore, too

0人赞添加讨论(0) 举报

How do I capture utf-8 decode errors in node.js?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间