When reading an NSString
from a file I can use initWithContentsOfFile:usedEncoding:error:
and it will guess the encoding of the file.
When I create it from an NSData
though my only option is initWithData:encoding:
where I have to explicitly pass the encoding. How can I reliably guess the encoding when I work with NSData
instead of files?
In general, you can’t. However, you can quite reliably identify UTF-8 files – if a file is valid UTF-8, it’s not very likely that it’s supposed to be any other encoding (except if all the bytes are in the ASCII range, in which case any “extended ASCII” encoding, including UTF-8, will give you the same result). All Unicode encodings also have an optional BOM which identifies them. So a reasonable approach would be:
- Look for a valid BOM. If there is one, use the appropriate encoding.
- Otherwise, try to interpret it as UTF-8. You can do this by calling
initWithData:data encoding:NSUTF8StringEncoding
and checking if the result is non-nil.
- If that fails, use a default 8-bit encoding, such as
-[NSString defaultCStringEncoding]
(which provides a locale-appropriate guess).
It is possible to try to improve the guess in the last step by trying various different encodings and choosing the one which has fewest sequences of letters with junk in the middle, where “junk” is any character that’s not a letter, space or common punctuation mark. This would significantly increase complexity while not actually being reliable.
In short, to be able to handle all available encodings you need to do what TextEdit does: shunt the decision over to the user.
Oh, one more thing: as of 10.5, the encoding is often stored with a file in the undocumented com.apple.TextEncoding extended attribute. If you open a file with +[NSString stringWithContentsOfFile:]
or similar, this will automatically be used if present.
In iOS 8 and OS X 10.10 there is a new API on NSString
:
Objective-C
+ (NSStringEncoding)stringEncodingForData:(NSData *)data
encodingOptions:(NSDictionary *)opts
convertedString:(NSString **)string
usedLossyConversion:(BOOL *)usedLossyConversion;
Swift
open class func stringEncoding(for data: Data,
encodingOptions opts: [StringEncodingDetectionOptionsKey : Any]? = nil,
convertedString string: AutoreleasingUnsafeMutablePointer<NSString?>?,
usedLossyConversion: UnsafeMutablePointer<ObjCBool>?) -> UInt
Now you can let the framework do the guess and in my experience that works really well!
From the header (the documentation does not state the method at the moment but it was officially mentioned in WWDC Session 204 (page 270):
- an array of suggested string encodings (without specifying the 3rd option in this list, all string encodings are considered but the ones in the array will have a higher preference; moreover, the order of the encodings in the array is important: the first encoding has a higher preference than the second one in the array)
- an array of string encodings not to use (the string encodings in this list will not be considered at all)
- a boolean option indicating whether only the suggested string encodings are considered
- a boolean option indicating whether lossy is allowed
- an option that gives a specific string to substitude for mystery bytes
- the current user's language
- a boolean option indicating whether the data is generated by Windows
If the values in the dictionary have wrong types (for example, the value of NSStringEncodingDetectionSuggestedEncodingsKey is not an array), an exception is thrown.
If the values in the dictionary are unknown (for example, the value in the array of suggested string encodings is not a valid encoding), the values will be ignored.
Example (Swift):
var convertedString: NSString?
let encoding = NSString.stringEncoding(for: data, encodingOptions: nil, convertedString: &convertedString, usedLossyConversion: nil)
If you just want the decoded string and don't care about the encoding you can remove the let encoding =