While fetching a UTF-8
-encoded file over the network using the NSURLConnection
class, there's a good chance the delegate's connection:didReceiveData:
message will be sent with an NSData
which truncates the UTF-8
file - because UTF-8
is a multi-byte encoding scheme, and a single character can be sent in two separate NSData
In other words, if I join all the data I get from connection:didReceiveData:
I will have a valid UTF-8
file, but each separate data is not valid UTF-8
().
I do not want to store all the downloaded file in memory.
What I want is: given NSData
, decode whatever you can into an NSString
. In case the last
few byte of the NSData
are an unclosed surrogate, tell me, so I can save them for the next NSData
.
One obvious solution is repeatedly trying to decode using initWithData:encoding:
, each time truncating the last byte, until success. This, unfortunately, can be very wasteful.
UTF-8 is a pretty simple encoding to parse and was designed to make it easy to detect incomplete sequences and, if you start in the middle of an incomplete sequence, to find its beginning.
Search backward from the end for a byte that's either <= 0x7f or > 0xc0. If it's <= 0x7f, it's complete. If it's between 0xc0 and 0xdf, inclusive, it requires one following byte to be complete. If it's between 0xe0 and 0xef, it requires two following bytes to be complete. If it's >= 0xf0, it requires three following bytes to be complete.
If you want to make sure that you don't stop in the middle of a UTF-8 multi-byte sequence, you're going to need to look at the end of the byte array and check the top 2 bits.
Look at the multi-byte table in the Wikipedia entry: http://en.wikipedia.org/wiki/UTF-8
I have a similar problem - partly decoding utf8
before
after [solved]