i try get html page with UTF-8 charset
NSString *html=[NSString stringWithContentsOfURL:[NSURL URLWithString: @"http://forums.drom.ru/general/t1151288178.html"] encoding:NSUTF8StringEncoding error:&error]);
but NSLog(@"%@",html)
return null
Why is this happening?
The problem is that while the file's meta tag purports to be UTF8, it's not (at least not entirely). You can confirm this by:
Download the html (as NSData
, which succeeds):
NSError *error = nil;
NSURL *url = [NSURL URLWithString:@"http://forums.drom.ru/general/t1151288178.html"];
NSData *data = [NSData dataWithContentsOfURL:url options:0 error:&error];
NSString *docsPath = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES)[0];
NSString *filename = [docsPath stringByAppendingPathComponent:@"test.html"];
[data writeToFile:filename atomically:YES];
Run iconv
from the Terminal command line, it will report an error (including line number and character number):
iconv -f UTF-8 test.html > /dev/null
Thanks to Torsten Marek for sharing that with us.
When I look at that portion of the HTML, there are definitely not UTF8 characters there, buried in the setting of the clever_cut_pattern
JavaScript variable.
If we thought you just got the encoding wrong, the typical counsel in these cases would generally be to use the rendition of stringWithContentOfURL
with the usedEncoding
parameter (i.e. rather than guessing what the encoding is, let NSString
determine this for you):
NSStringEncoding encoding;
NSString *html = [NSString stringWithContentsOfURL:url usedEncoding:&encoding error:&error];
Unfortunately, in this case, even that fails (presumably because the file purports to be UTF8, but isn't).
The question then becomes "ok, so what do I do now". It depends upon why you were trying to download that HTML in your app, anyway. If you really need to convert this to UTF8 (i.e. strip out the non-UTF8 characters), you could theoretically get the GNU iconv(3)
function, which is part of the libiconv
library. That could identify non-conforming characters that you could presumably remove. It's a question of how much work you're willing to go through to handle this non-conforming web page.