In Objective C is there a way to convert a multi-byte unicode byte array into an NSString, where it will allow the conversion to succeed even if the array data is a partial buffer (not on a complete character boundary)?
The application of this is when receiving byte buffers in a stream, and you want to parse the string version of the data buffer (but there is more data to come, and your buffer data doesn't have complete multi-byte unicode).
NSString's initWithData:encoding:
method does not work for this purpose, as shown here...
Test code:
- (void)test {
char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'};
size_t sizeOfMyArray = sizeof(myArray);
[self dump:myArray sizeOfMyArray:sizeOfMyArray];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 1];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 2];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 3];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 4];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 5];
}
- (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength {
NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding];
NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string);
}
Output:
sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar'
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba'
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b'
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×'
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)'
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'
As can be seen, converting the "sourceLength: 4 bytes" byte array fails, and returns (null)
. This is because the UTF-8 unicode '×' character (0xc3 0x97) is only partially included.
Ideally there would be a function that I can use that would return the correct NString, and tell me how many bytes are "left over".
I had this problem before and forget it for a while. It was an opportunity to do it. The code below is done with informations from the utf-8 page on wikipedia. It is a category on NSData.
It check the data from the end and only the four last bytes because the OP said that it can be giga byte of data. Otherwise with utf-8 it's simpler to run through the bytes from the beginning.
Here is the static functions used in the method:
Here is my inefficient implementation, which I don't consider to be a correct answer. I'll leave it here in case others find it useful (and in the hope that someone else will give a better answer than this!)
It's in a category on
NSMutableData
...You largely have your own answer. If the
initWithData:dataWithBytes:encoding:
method returnsnil
, then you know the buffer has a partial (invalid) character at the end.Modify
dump
to return anint
. Then have it attempt to create theNSString
in a loop. Each time you getnil
, reduce the length and try again. Once you get a validNSString
, return the difference between the used length and the passed length.