Suppose I have a byte stream (array), and I want to write code (using .Net C#) to validate whether it is valid UTF-8 byte sequence or not. I want to write code from scratch because I need to report the exact location where there is invalid byte sequences and may even remove invalid bytes -- not just want to get yes or no about whether the byte stream/array is valid.
Are there any sample codes to make reference? If no C# code, simple samples in C++/Java are also appreciated. Thanks!
For the invalid byte sequences of UTF-8, I mean
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
thanks in advance,
George
What you need is DecoderFallback. When the Encoding
class is trying to convert a sequence of bytes to the target encoding, you can specify fallback behaviour:
- Either report error and stop processing.
- Or find the error and replace it.
Using UTF8Encoding
and DecoderReplacementFallback
you can achieve just what you're looking for.
This is what the original question asked for, even if it isn't quite what the original poster really needed. However, I've gone and written some C code to validate a byte stream as utf-8, and made it available freely. Maybe someone else directed at this question via a Google search will find it useful.
It takes one byte at a time, so is suitable for stream processing, and classifies everything into either valid UTF-8 or one of these possible errors in the byte sequence:
/* Ways a UTF stream can screw up */
/* a multibyte sequence without as many continuation bytes as expected. e.g. [ef 81] 48 */
#define MISSING_CONTINUATION 1
/* A continuation byte when not expected */
#define UNEXPECTED_CONTINUATION 2
/* A full multibyte sequence encoding something that should have been encoded shorter */
#define OVERLONG_FORM 3
/* A full multibyte sequence encoding something larger than 10FFFF */
#define OUT_OF_RANGE 4
/* A full multibyte sequence encoding something in the range U+D800..U+DFFF */
#define BAD_SCALAR_VALUE 5
/* bytes 0xFE or 0xFF */
#define INVALID 6
This validator has the nice property that if a and b are valid utf-8 byte streams, and x is some other stream of bytes, then the concatenation a + x + b will be decoded as all of the characters encoded in a, some combination of characters and errors, then all of the characters encoded in x. That is, an invalid sequence of bytes can't eat validly encoded characters that start after the bad bytes.
Nice point. I didn't know that there exists non valid UTF-8 sequences.
The article at the wikiedia is a starting point but I don't think that you can have a complete test. Can you? I am interested
A complete test means that for every sequence you can have a function that answers yes or no for every possible sequence. A full function.
The point is what to do or to return if your sequence is not complete (a short sequence). As far as I know there are some editors that add a special character in order to fulfill it. May be you should handle such cases as invalid sequences and then your test will be complete.
I wonder if this is the only case.
Anyway, I will put this question as a favourite in order to keep track of answers. Sure somebody will illuminate us.
static void CheckUTF8(byte[] data)
{
new UTF8Encoding(false, true).GetCharCount(data);
}
Throws a DecoderFallbackException
on invalid data. DecoderFallbackException.Index
should point to the index of the invalid sequence.