How can I read a large UTF-8 file on an iPhone?

2020-06-09 06:39发布

问题:

My app downloads a file in UTF-8 format, which is too large to read using the NSString initWithContentsOfFile method. The problem I have is that the NSFileHandle readDataOfLength method reads a specified number of bytes, and I may end up only reading part of a UTF-8 character. What is the best solution here?

LATER:

Let it be recorded in the ship's log that the following code works:

    NSData *buf = [NSData dataWithContentsOfFile:path
                                      options:NSDataReadingMappedIfSafe
                                        error:nil];

NSString *data = [[[NSString alloc] 
                   initWithBytesNoCopy:(void *)buf.bytes 
                   length:buf.length 
                   encoding:NSUTF8StringEncoding 
                   freeWhenDone:NO] autorelease];

My main problem was actually to do with the encoding, not the task of reading the file.

回答1:

You can use NSData +dataWithContentsOfFile:options:error: with the NSDataReadingMappedIfSafe option to map your file to memory rather than loading it. So that'll use the virtual memory manager in iOS to ensure that bits of the file are swapped in and out of RAM in the same way that a desktop OS handles its on-disk virtual memory file. So you don't need enough RAM to keep the entire file in memory at once, you just need the file to be small enough to fit in the processor's address space (so, gigabytes). You'll get an object that acts exactly like a normal NSData, which should save you most of the hassle related to using an NSFileHandle and manually streaming.

You'll probably then need to convert portions to NSString since you can realistically expect that to convert from UTF-8 to another format (though it might not; it's worth having a go with -initWithData:encoding: and seeing whether NSString is smart enough just to keep a reference to the original data and to expand from UTF-8 on demand), which I think is what your question is really getting at.

I'd suggest you use -initWithBytes:length:encoding: to convert a reasonable number of bytes to a string. You can then use -lengthOfBytesUsingEncoding: to find out how many bytes it actually made sense of and advance your read pointer appropriately. It's a safe assumption that NSString will discard any part characters at the end of the bytes you provide.

EDIT: so, something like:

// map the file, rather than loading it
NSData *data = [NSData dataWithContentsOfFile:...whatever...
                         options:NSDataReadingMappedIfSafe
                         error:&youdDoSomethingSafeHere];

// we'll maintain a read pointer to our current location in the data
NSUinteger readPointer = 0;

// continue while data remains
while(readPointer < [data length])
{
    // work out how many bytes are remaining
    NSUInteger distanceToEndOfData = [data length] - readPointer;

    // grab at most 16kb of them, being careful not to read too many
    NSString *newPortion = 
         [[NSString alloc] initWithBytes:(uint8_t *)[data bytes] + readPointer
                 length:distanceToEndOfData > 16384 ? 16384 : distanceToEndOfData
                 encoding:NSUTF8StringEncoding];

    // do whatever we want with the string
    [self doSomethingWithFragment:newPortion];

    // advance our read pointer by the number of bytes actually read, and
    // clean up
    readPointer += [newPortion lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
    [newPortion release];
}

Of course, an implicit assumption is that all UTF-8 encodings are unique, which I have to admit not to being knowledgable enough to say for absolute certain.



回答2:

It's actually really easy to tell if you have split a multibyte character in UTF-8. Continuation characters all have the two most significant bits set like this: 10xxxxxx. So if the last octet of the buffer has that pattern, scan backwards to find an octet that does not have that form. This is the first octet of the character. The position of the most significant 0 in the octet tells you how many octets are in the character

0xxxxxxx => 1 octet (ASCII)
110xxxxx => 2 octets
1110xxxx => 3 octets

and so on up to 6 octets.

So it's fairly trivial to figure out how many extra octets to read to get to a character boundary.



回答3:

One approach would be to

  1. read up to a certain point -
  2. then examine the last byte(s) to determine if it is splitting a UTF-8 character
  3. if not - read the next chunk
  4. if yes, get the next byte and fix - then read the next chunk


回答4:

utf8 is self synchronizing - just read a little more or less as needed, then read the byte values to determine the boundaries for any code point.

also, you could use fopen and use a small, manageable buffer on the stack for this and memory will not be an issue.