I'm using Xerces to parse my xml document. The issue is that xml escaped characters like ' ' appear in characters() method as non-escaped ones. I need to get escaped characters inside characters() method as is.
Thanks.
UPD: Tried to override resolveEntity method im my DefaultHandler's descendant. Can see from debug that it's set as entity resolver to xml reader but code from overridden method is not invoked.
If you supply a LexicalHandler as a callback to the SAX parser, it will inform you of the start and end of every entity reference using startEntity() and endEntity() callbacks.
(Note that the JavaDoc at http://download.oracle.com/javase/1.5.0/docs/api/org/xml/sax/ext/LexicalHandler.html talks of "entities" when the correct term is "entity references").
Note also that there is no way to get a SAX parser to tell you about numeric character references such as
ሴ
. Applications are supposed to treat these in exactly the same way as the original character, so you really shouldn't be interested in them.I think your solution is not too bad: a few lines of code to do exactly what you want. The problem is that
startEntity
andendEntity
methods are not provided byContentHandler
interface, so you have to write aLexicalHandler
which works in combination with yourContentHandler
. Usually, the use of anXMLFilter
is more elegant, but you have to work with entity, so you still should write aLexicalHandler
. Take a look here for an introduction to the use of SAX filters.I'd like to show you a way, very similar to yours, which allows you to separate filtering operations (wrapping & to
&
for instance) from output operations (or something else). I've written my ownXMLFilter
based onXMLFilterImpl
which also implementsLexicalHandler
interface. This filter contains only the code related to entites escape/unescape.And this is my main, with a
DefaultHandler
asContentHandler
which receives the entity as it is according to the filter code:And this is my output:
Probably you don't like it, anyway this is an alternative solution.
I'm sorry, but with
SaxParser
I think you don't have a more elegant way.You should also consider switching to
StaxParser
: it's very easy to do what you want withXMLInputFactory.IS_REPLACING_ENTITY_REFERENCE
set to false. If you like this solution, you should take a look here.The temporary solution:
But still need elegant solution.
There is one more may:
escapeXml
method oforg.apache.commons.lang.StringEscapeUtils
class.Try this code in your
characters(char[] ch, int start, int length)
method:You may download the jar here.