Decoding partial UTF-8 into NSString

2019-05-17 17:38发布

问题:

While fetching a UTF-8-encoded file over the network using the NSURLConnection class, there's a good chance the delegate's connection:didReceiveData: message will be sent with an NSData which truncates the UTF-8 file - because UTF-8 is a multi-byte encoding scheme, and a single character can be sent in two separate NSData

In other words, if I join all the data I get from connection:didReceiveData: I will have a valid UTF-8 file, but each separate data is not valid UTF-8 ().

I do not want to store all the downloaded file in memory.

What I want is: given NSData, decode whatever you can into an NSString. In case the last few byte of the NSData are an unclosed surrogate, tell me, so I can save them for the next NSData.

One obvious solution is repeatedly trying to decode using initWithData:encoding:, each time truncating the last byte, until success. This, unfortunately, can be very wasteful.

回答1:

If you want to make sure that you don't stop in the middle of a UTF-8 multi-byte sequence, you're going to need to look at the end of the byte array and check the top 2 bits.

  1. If the top bit is 0, then it's one of the ASCII-style unescaped UTF-8 codes, and you're done.
  2. If the top bit is 1 and the second-from-top is 0, then it the continuation of an escape sequence and might represent the last byte of that sequence, so you will need to buffer the character for later and then look at the preceding character*
  3. If the top bit is 1 and the second-from-top is also 1, then it is the beginning of the multi-byte sequence and you need to determine how many characters are in the sequence by looking for the first 0 bit.

Look at the multi-byte table in the Wikipedia entry: http://en.wikipedia.org/wiki/UTF-8

// assumes that receivedData contains both the leftovers and the new data

unsigned char *data= [receivedData bytes];
UInteger byteCount= [receivedData length];

if (byteCount<1)
    return nil;  // or @"";

unsigned char *lastByte = data[byteCount-1];
if ( lastByte & 0x80 == 0) {
    NSString *newString = [NSString initWithBytes: data length: byteCount 
                                    encoding: NSUTF8Encoding];
    // verify success
    // remove bytes from mutable receivedData, or set overflow to empty
    return newString;
}

// now eat all of the continuation bytes
UInteger backCount=0;
while ( (byteCount > 0)  && (lastByte & 0xc0 == 0x80)) {
    backCount++;
    byteCount--;
    lastByte = data[byteCount-1];
}
// at this point, either we have exhausted byteCount or we have the initial character
// if we exhaust the byte count we're probably in an illegal sequence, as we should 
// always have the initial character in the receivedData

if (byteCount<1) {
    // error!
    return nil;
}

// at this point, you can either use just byteCount, or you can compute the 
// length of the sequence from the lastByte in order
// to determine if you have exactly the right number of characters to decode UTF-8.

UInteger requiredBytes = 0;
if (lastByte & 0xe0 == 0xc0) {  // 110xxxxx
    // 2 byte sequence
    requiredBytes= 1;
} else if (lastByte & 0xf0 == 0xe0) {   // 1110xxxx
    // 3 byte sequence
    requiredBytes= 2;
} else if (lastByte & 0xf8 == 0xf0) {   // 11110xxx
    // 4 byte sequence
    requiredBytes= 3;
} else if (lastByte & 0xfc == 0xf8) {   // 111110xx
    // 5 byte sequence
    requiredBytes= 4;
} else if (lastByte & 0xfe == 0xfc) {   // 1111110x
    // 6 byte sequence
    requiredBytes= 5;
 } else {
    // shouldn't happen, illegal UTF8 seq
 }

 // now we know how many characters we need and we know how many
 //  (backCount) we have, so either use them, or take the 
 // introductory character away.
 if (requiredBytes==backCount) {
     // we have the right number of bytes
     byteCount += backCount;
 } else { 
     // we don't have the right number of bytes, so remove the intro character 
     byteCount -= 1;   
 }

 NSString *newString = [NSString initWithBytes: data length: byteCount 
                                 encoding: NSUTF8Encoding];
 // verify success
 // remove byteCount bytes from mutable receivedData, or set overflow to the 
 // bytes between byteCount and [receivedData count]
 return newString;


回答2:

UTF-8 is a pretty simple encoding to parse and was designed to make it easy to detect incomplete sequences and, if you start in the middle of an incomplete sequence, to find its beginning.

Search backward from the end for a byte that's either <= 0x7f or > 0xc0. If it's <= 0x7f, it's complete. If it's between 0xc0 and 0xdf, inclusive, it requires one following byte to be complete. If it's between 0xe0 and 0xef, it requires two following bytes to be complete. If it's >= 0xf0, it requires three following bytes to be complete.



回答3:

I have a similar problem - partly decoding utf8

before

  NSString * adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    adsInfo->adsTopic = malloc(sizeof(char) * adsTopic.length + 1);
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], adsTopic.length + 1);

after [solved]

  NSString *adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    NSUInteger byteCount = [adsTopic lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
    NSLog(@"number of Unicode characters in the string topic == %lu",(unsigned long)byteCount);

    adsInfo->adsTopic = malloc(byteCount+1);
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], byteCount + 1);

    NSString *text=[NSString stringWithCString:adsInfo.adsTopic encoding:NSUTF8StringEncoding];
                NSLog(@"=== %@", text);