可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Suppose I have a byte stream (array), and I want to write code (using .Net C#) to validate whether it is valid UTF-8 byte sequence or not. I want to write code from scratch because I need to report the exact location where there is invalid byte sequences and may even remove invalid bytes -- not just want to get yes or no about whether the byte stream/array is valid.

Are there any sample codes to make reference? If no C# code, simple samples in C++/Java are also appreciated. Thanks!

For the invalid byte sequences of UTF-8, I mean

http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

thanks in advance, George

回答1:

What you need is DecoderFallback. When the Encoding class is trying to convert a sequence of bytes to the target encoding, you can specify fallback behaviour:

Either report error and stop processing.
Or find the error and replace it.

Using UTF8Encoding and DecoderReplacementFallback you can achieve just what you're looking for.

回答2:

This is what the original question asked for, even if it isn't quite what the original poster really needed. However, I've gone and written some C code to validate a byte stream as utf-8, and made it available freely. Maybe someone else directed at this question via a Google search will find it useful.

It takes one byte at a time, so is suitable for stream processing, and classifies everything into either valid UTF-8 or one of these possible errors in the byte sequence:

/* Ways a UTF stream can screw up */
/* a multibyte sequence without as many continuation bytes as expected.  e.g. [ef 81] 48 */
#define MISSING_CONTINUATION 1 
/* A continuation byte when not expected */
#define UNEXPECTED_CONTINUATION 2 
/* A full multibyte sequence encoding something that should have been encoded shorter */
#define OVERLONG_FORM 3
/* A full multibyte sequence encoding something larger than 10FFFF */
#define OUT_OF_RANGE 4
/* A full multibyte sequence encoding something in the range U+D800..U+DFFF */
#define BAD_SCALAR_VALUE 5
/* bytes 0xFE or 0xFF */
#define INVALID 6

This validator has the nice property that if a and b are valid utf-8 byte streams, and x is some other stream of bytes, then the concatenation a + x + b will be decoded as all of the characters encoded in a, some combination of characters and errors, then all of the characters encoded in x. That is, an invalid sequence of bytes can't eat validly encoded characters that start after the bad bytes.

回答3:

Nice point. I didn't know that there exists non valid UTF-8 sequences.

The article at the wikiedia is a starting point but I don't think that you can have a complete test. Can you? I am interested

A complete test means that for every sequence you can have a function that answers yes or no for every possible sequence. A full function.

The point is what to do or to return if your sequence is not complete (a short sequence). As far as I know there are some editors that add a special character in order to fulfill it. May be you should handle such cases as invalid sequences and then your test will be complete. I wonder if this is the only case.

Anyway, I will put this question as a favourite in order to keep track of answers. Sure somebody will illuminate us.

回答4:

static void CheckUTF8(byte[] data)
{
    new UTF8Encoding(false, true).GetCharCount(data);
}

Throws a DecoderFallbackException on invalid data. DecoderFallbackException.Index should point to the index of the invalid sequence.

looking for samples to validate UTF-8

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

looking for samples to validate UTF-8

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮