I'd like to parse some well-formed XML into a DOM, but I'd like know the offset of each node's tag in the original media.
For example, if I had an XML document with the content something like:
<html>
<body>
<div>text</div>
</body>
</html>
I'd like to know that the node starts at offset 13 in the original media, and (more importantly) that "text" starts at offset 18.
Is this possible with standard Java XML parsers? JAXB? If no solution is easily available, what type of changes are necessary along the parsing path to make this possible?
The SAX API provides a rather obscure mechanism for this - the org.xml.sax.Locator
interface. When you use the SAX API, you subclass DefaultHandler
and pass that to the SAX parse methods, and the SAX parser implementation is supposed to inject a Locator
into your DefaultHandler
via setDocumentLocator()
. As the parsing proceeds, the various callback methods on your ContentHandler
are invoked (e.g. startElement()
), at which point you can consult the Locator
to find out the parsing position (via getColumnNumber()
and getLineNumber()
)
Technically, this is optional functionality, but the javadoc says that implementations are "strongly encouraged" to provide it, so you can likely assume the SAX parser built into JavaSE will do it.
Of course, this does mean using the SAX API, which is noone's idea of fun, but I can't see a way of accessing this information using a higher-level API.
edit: Found this example.
Use the XML Streamreader and its getLocation() method to return location object. location.getCharacterOffset() gives the byte offset of current location.
import javax.xml.stream.Location;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
public class Runner {
public static void main(String argv[]) {
XMLInputFactory factory = XMLInputFactory.newInstance();
try{
XMLStreamReader streamReader = factory.createXMLStreamReader(
new FileReader("D:\\BigFile.xml"));
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
Location location = streamReader.getLocation();
System.out.println("byte location: " + location.getCharacterOffset());
}
}
} catch(Exception e){
e.printStackTrace();
}