iOS utf-8 encoding issue

2019-08-15 04:41发布

问题:

i try get html page with UTF-8 charset

NSString *html=[NSString stringWithContentsOfURL:[NSURL URLWithString:  @"http://forums.drom.ru/general/t1151288178.html"] encoding:NSUTF8StringEncoding error:&error]);

but NSLog(@"%@",html) return null Why is this happening?

回答1:

The problem is that while the file's meta tag purports to be UTF8, it's not (at least not entirely). You can confirm this by:

  • Download the html (as NSData, which succeeds):

    NSError *error = nil;
    NSURL *url = [NSURL URLWithString:@"http://forums.drom.ru/general/t1151288178.html"];
    NSData *data = [NSData dataWithContentsOfURL:url options:0 error:&error];
    NSString *docsPath = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES)[0];
    NSString *filename = [docsPath stringByAppendingPathComponent:@"test.html"];
    [data writeToFile:filename atomically:YES];
    
  • Run iconv from the Terminal command line, it will report an error (including line number and character number):

    iconv -f UTF-8 test.html > /dev/null
    

    Thanks to Torsten Marek for sharing that with us.

When I look at that portion of the HTML, there are definitely not UTF8 characters there, buried in the setting of the clever_cut_pattern JavaScript variable.

If we thought you just got the encoding wrong, the typical counsel in these cases would generally be to use the rendition of stringWithContentOfURL with the usedEncoding parameter (i.e. rather than guessing what the encoding is, let NSString determine this for you):

NSStringEncoding encoding;
NSString *html = [NSString stringWithContentsOfURL:url usedEncoding:&encoding error:&error];

Unfortunately, in this case, even that fails (presumably because the file purports to be UTF8, but isn't).

The question then becomes "ok, so what do I do now". It depends upon why you were trying to download that HTML in your app, anyway. If you really need to convert this to UTF8 (i.e. strip out the non-UTF8 characters), you could theoretically get the GNU iconv(3) function, which is part of the libiconv library. That could identify non-conforming characters that you could presumably remove. It's a question of how much work you're willing to go through to handle this non-conforming web page.