Convert a multi-byte unicode byte array into an NS

2019-08-09 06:27发布

In Objective C is there a way to convert a multi-byte unicode byte array into an NSString, where it will allow the conversion to succeed even if the array data is a partial buffer (not on a complete character boundary)?

The application of this is when receiving byte buffers in a stream, and you want to parse the string version of the data buffer (but there is more data to come, and your buffer data doesn't have complete multi-byte unicode).

NSString's initWithData:encoding: method does not work for this purpose, as shown here...

Test code:

    - (void)test {
        char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'};
        size_t sizeOfMyArray = sizeof(myArray);
        [self dump:myArray sizeOfMyArray:sizeOfMyArray];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5];
    }

    - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength {
        NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding];
        NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string);
    }

Output:

sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar'
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba'
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b'
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×'
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)'
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'

As can be seen, converting the "sourceLength: 4 bytes" byte array fails, and returns (null). This is because the UTF-8 unicode '×' character (0xc3 0x97) is only partially included.

Ideally there would be a function that I can use that would return the correct NString, and tell me how many bytes are "left over".

3条回答
甜甜的少女心
2楼-- · 2019-08-09 06:40

I had this problem before and forget it for a while. It was an opportunity to do it. The code below is done with informations from the utf-8 page on wikipedia. It is a category on NSData.

It check the data from the end and only the four last bytes because the OP said that it can be giga byte of data. Otherwise with utf-8 it's simpler to run through the bytes from the beginning.

/* 
 Return the range of a valid utf-8 encoded text by
 removing partial trailing multi-byte char.
 It assumes that all the bytes are valid utf-8 encoded char,
 e.g. it don't raise a flag if a continuation byte is preceded
 by a single char byte.
 */
 - (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes
 {
    NSRange validRange = {0, 0};

    NSUInteger trailLength = MIN([self length], 4U);
    unsigned char trail[4];
    [self getBytes:&trail
             range:NSMakeRange([self length] - trailLength, trailLength)];

    unsigned multibyteCount = 0;

    for (NSInteger i = trailLength - 1; i >= 0; i--) {
        if (isUTF8SingleByte(trail[i])) {
            validRange = NSMakeRange(0, [self length] - trailLength + i + 1);
            break;
        }

        if (isUTF8ContinuationByte(trail[i])) {
            multibyteCount++;
            continue;
        }

        if (isUTF8StartByte(trail[i])) {
            multibyteCount++;
            if (multibyteCount == lengthForUTF8StartByte(trail[i])) {
                validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount);
            }
            else {
                validRange = NSMakeRange(0, [self length] - trailLength + i);
            } 
            break;
        }
    }
    return validRange;
}

Here is the static functions used in the method:

static BOOL isUTF8SingleByte(const unsigned char c)
{
    return c <= 0x7f;
}

static BOOL isUTF8ContinuationByte(const unsigned char c)
{
    return (c >= 0x80) && (c <= 0xbf);
}

static BOOL isUTF8StartByte(const unsigned char c)
{
    return (c >= 0xc2) && (c <= 0xf4);
}

static BOOL isUTF8InvalidByte(const unsigned char c)
{
    return (c == 0xc0) || (c == 0xc1) || (c > 0xf4);
}

static unsigned lengthForUTF8StartByte(const unsigned char c)
{
    if ((c >= 0xc2) && (c <= 0xdf)) {
        return 2;
    }
    else if ((c >= 0xe0) && (c <= 0xef)) {
        return 3;
    }
    else if ((c >= 0xf0) && (c <= 0xf4)) {
        return 4;
    }
    return 1;
}
查看更多
Fickle 薄情
3楼-- · 2019-08-09 06:54

Here is my inefficient implementation, which I don't consider to be a correct answer. I'll leave it here in case others find it useful (and in the hope that someone else will give a better answer than this!)

It's in a category on NSMutableData...

    /**
    * Removes the biggest string possible from this NSMutableData, leaving any remainder unicode half-characters behind.
    *
    * NOTE: This is a very inefficient implementation, it may require multiple parsing of the complete NSMutableData buffer,
    * it is especially inefficient when the data buffer does not contain a valid string encoding, as all lengths will be
    * attempted.
    */
    - (NSString *)removeMaximumStringUsingEncoding:(NSStringEncoding)encoding {
        if (self.length > 0) {
            // Quick test for the case where the whole buffer can be used (is common case, and doesn't require NSData manipulation).
            NSString *result = [[NSString alloc] initWithData:self encoding:encoding];
            if (result != Nil) {
                self.length = 0; // Simple case, we used the whole buffer.
                return result;
            }

            // Try to find the largest subData that is a valid string.
            for (NSUInteger subDataLength = self.length - 1; subDataLength > 0; subDataLength--) {
                NSRange subDataRange = NSMakeRange(0, subDataLength);
                result = [[NSString alloc] initWithData:[self subdataWithRange:subDataRange] encoding:encoding];
                if (result != Nil) {
                    // Delete the bytes we used from our buffer, leave the remainder.
                    [self replaceBytesInRange:subDataRange withBytes:Nil length:0];
                    return result;
                }
            }
        }
        return @"";
    }
查看更多
The star\"
4楼-- · 2019-08-09 07:00

You largely have your own answer. If the initWithData:dataWithBytes:encoding: method returns nil, then you know the buffer has a partial (invalid) character at the end.

Modify dump to return an int. Then have it attempt to create the NSString in a loop. Each time you get nil, reduce the length and try again. Once you get a valid NSString, return the difference between the used length and the passed length.

查看更多
登录 后发表回答