i think i read every single web page relating to this problem but i still cannot find a solution to it, so here i am.
I have an HTML web page wich is not under my control and i need to parse it from my iPhone application. Here it is a sample of the web page i'm talking about:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</HEAD>
<BODY>
<LI class="bye bye" rel="hello 1">
<H5 class="onlytext">
<A name="morning_part">morning</A>
</H5>
<DIV class="mydiv">
<SPAN class="myclass">something about you</SPAN>
<SPAN class="anotherclass">
<A href="http://www.google.it">Bye Bye è un saluto</A>
</SPAN>
</DIV>
</LI>
</BODY>
</HTML>
I'm using NSXMLParser and it is going well till it find the è html entity. It calls foundCharacters: for "Bye Bye" and then it calls resolveExternalEntityName:systemID:: with an entityName of "egrave". In this method i'm just returning the character "è" trasformed in an NSData, the foundCharacters is called again adding the string "è" to the previous one "Bye Bye " and then the parser raise the NSXMLParserUndeclaredEntityError error.
I have no DTD and i cannot change the html file i'm parsing. Do you have any ideas on this problem? Thanks in advance to all of you, Rob.
Update (12/03/2010). After the suggestion of Griffo i ended up with something like this:
data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];
where replaceHtmlEntities:(NSData *) is something like this:
- (NSData *)replaceHtmlEntities:(NSData *)data {
NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
[temp replaceOccurrencesOfString:@"&" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
[temp replaceOccurrencesOfString:@" " withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
...
[temp replaceOccurrencesOfString:@"À" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
return finalData;
}
But i am still looking the best way to solve this problem. I will try TouchXml in the next days but i still think that there should be a way to do this using NSXMLParser API, so if you know how, feel free to write it here :)
Since I've just started doing iOS development I've been searching for the same thing and found a related mailing list entry: http://www.mail-archive.com/cocoa-dev@lists.apple.com/msg17706.html
This is fairly similar to your original solution and also causes a parser error
NSXMLParserErrorDomain error 26
; but it does continue parsing after that. The problem is, of course, that it's harder to tell real errors apart ;-)A possibly less hacky solution is replace the DTD with a local modified one with all external entity declaration replaced with local one.
This is how I do it:
First, find and replace the document DTD declaration with a local file. For example, replace this:
with this:
```
Download the DTD from the W3C URL and add it to your app bundle. You can find the path of the file with following code:
Open the DTD file, find any external entity reference:
replace it with the content of the entity file ( http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent in the above case)
After replacing all external reference, NSXMLParser should properly handle the entities without the need to download every remote DTD/external entities each time it parse a XML file.
After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities
<, >, ', " and &
The code below fails resulting in an
NSXMLParserUndeclaredEntityError
.Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to
parser:foundCharacters
and the è and à characters are dropped.In another experiment, I created a completely valid xml document with an internal DTD
I implemented the
parser:foundInternalEntityDeclarationWithName:value:;
delegate method and it is clear that the parser is getting the entity data, however theparser:foundCharacters
is only called for the pre-defined entities.I found a link to a tutorial on Using the SAX Interface of LibXML. The
xmlSAXHandler
that is used byNSXMLParser
allows for agetEntity
callback to be defined. After callinggetEntity
, the expansion of the entity is passed to thecharacters
callback.NSXMLParser
is missing functionality here. What should happen is that theNSXMLParser
or itsdelegate
store the entity definitions and provide them to thexmlSAXHandler
getEntity
callback. This is clearly not happening. I will file a bug report.In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the
libxml
parser on your own is worthwhile.This has been fun.
I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.
The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.
The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.
You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.
I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.