在目标C是有办法多字节Unicode字节阵列转换成一个NSString,在那里将允许即使阵列数据是一个局部缓冲器(未在完整字符边界)转化为成功?
这个应用程序是在流接收字节的缓冲区时,你要分析的数据缓冲区的字符串版本(但有更多的数据来,和你的缓冲区中的数据不具有完整的多字节Unicode)。
的NSString的initWithData:encoding:
方法不适用于此目的,如下所示...
测试代码:
- (void)test {
char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'};
size_t sizeOfMyArray = sizeof(myArray);
[self dump:myArray sizeOfMyArray:sizeOfMyArray];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 1];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 2];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 3];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 4];
[self dump:myArray sizeOfMyArray:sizeOfMyArray - 5];
}
- (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength {
NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding];
NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string);
}
输出:
sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar'
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba'
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b'
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×'
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)'
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'
如可以看到的,将所述“sourceLength:4字节”字节数组失败,并返回(null)
。 这是因为,UTF-8的unicode '×' 字符(0xc3 0x97)仅部分地包括在内。
理想情况下,将是我可以使用,将返回正确的NString,告诉我是“剩”多少字节的函数。
你主要是有自己的答案。 如果initWithData:dataWithBytes:encoding:
方法返回nil
,那么你知道缓冲区有在最后一个部分(无效)字符。
修改dump
返回一个int
。 然后有它试图创建NSString
在一个循环。 每次得到的时间nil
,缩短长度,然后再试一次。 一旦你得到一个有效NSString
,返回所使用的长度和通过长度之间的差异。
我以前有这个问题,忘记了一段时间。 它是这样做的机会。 下面的代码是从信息做维基百科UTF-8页 。 这是NSData的类别。
它从最终只有四个字节最后检查数据,因为OP说,这可以是数据的字节千兆。 否则,使用UTF-8这是简单的从开头字节运行。
/*
Return the range of a valid utf-8 encoded text by
removing partial trailing multi-byte char.
It assumes that all the bytes are valid utf-8 encoded char,
e.g. it don't raise a flag if a continuation byte is preceded
by a single char byte.
*/
- (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes
{
NSRange validRange = {0, 0};
NSUInteger trailLength = MIN([self length], 4U);
unsigned char trail[4];
[self getBytes:&trail
range:NSMakeRange([self length] - trailLength, trailLength)];
unsigned multibyteCount = 0;
for (NSInteger i = trailLength - 1; i >= 0; i--) {
if (isUTF8SingleByte(trail[i])) {
validRange = NSMakeRange(0, [self length] - trailLength + i + 1);
break;
}
if (isUTF8ContinuationByte(trail[i])) {
multibyteCount++;
continue;
}
if (isUTF8StartByte(trail[i])) {
multibyteCount++;
if (multibyteCount == lengthForUTF8StartByte(trail[i])) {
validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount);
}
else {
validRange = NSMakeRange(0, [self length] - trailLength + i);
}
break;
}
}
return validRange;
}
下面是在该方法中使用的静态功能:
static BOOL isUTF8SingleByte(const unsigned char c)
{
return c <= 0x7f;
}
static BOOL isUTF8ContinuationByte(const unsigned char c)
{
return (c >= 0x80) && (c <= 0xbf);
}
static BOOL isUTF8StartByte(const unsigned char c)
{
return (c >= 0xc2) && (c <= 0xf4);
}
static BOOL isUTF8InvalidByte(const unsigned char c)
{
return (c == 0xc0) || (c == 0xc1) || (c > 0xf4);
}
static unsigned lengthForUTF8StartByte(const unsigned char c)
{
if ((c >= 0xc2) && (c <= 0xdf)) {
return 2;
}
else if ((c >= 0xe0) && (c <= 0xef)) {
return 3;
}
else if ((c >= 0xf0) && (c <= 0xf4)) {
return 4;
}
return 1;
}
这里是我的低效执行,我不认为是一个正确的答案。 我会离开这里,以防别人发现它很有用(在希望别人将给予比这更好的答案!)
这是在一个类别NSMutableData
...
/**
* Removes the biggest string possible from this NSMutableData, leaving any remainder unicode half-characters behind.
*
* NOTE: This is a very inefficient implementation, it may require multiple parsing of the complete NSMutableData buffer,
* it is especially inefficient when the data buffer does not contain a valid string encoding, as all lengths will be
* attempted.
*/
- (NSString *)removeMaximumStringUsingEncoding:(NSStringEncoding)encoding {
if (self.length > 0) {
// Quick test for the case where the whole buffer can be used (is common case, and doesn't require NSData manipulation).
NSString *result = [[NSString alloc] initWithData:self encoding:encoding];
if (result != Nil) {
self.length = 0; // Simple case, we used the whole buffer.
return result;
}
// Try to find the largest subData that is a valid string.
for (NSUInteger subDataLength = self.length - 1; subDataLength > 0; subDataLength--) {
NSRange subDataRange = NSMakeRange(0, subDataLength);
result = [[NSString alloc] initWithData:[self subdataWithRange:subDataRange] encoding:encoding];
if (result != Nil) {
// Delete the bytes we used from our buffer, leave the remainder.
[self replaceBytesInRange:subDataRange withBytes:Nil length:0];
return result;
}
}
}
return @"";
}
文章来源: Convert a multi-byte unicode byte array into an NSString, using a partial buffer