Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?
I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.
So my idea is this: Create an index and then quickly access data from the large file.
I can't just split the file because it's an official federal database that I want to use unaltered.
Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.
final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
r.nextTag();
while (r.hasNext()) {
final int eventType = r.next();
if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
&& Long.parseLong(r.getAttributeValue(null, "bla")) == bla
) {
// JAX-B works just fine:
final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
System.out.println(foo.getValue().getName());
// But how do I get the offset?
// cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
break;
}
}
But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)
Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall. I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.
Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.
I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.
You could work with a generated XML parser using ANTLR4.
The Following works very well on a ~17GB Wikipedia dump
/20170501/dewiki-20170501-pages-articles-multistream.xml.bz2
but I had to increase heap size using-xX6GB
.1. Get XML Grammar
2. Generate Parser
3. Copy Generated Java files to your Project
4. Hook in with a Listener to collect character offsets
5. Result
Prints:
Offsets: [2441, 10854, 30257, 51419 ....
6. Read from Offset Position
To test the code I've written class that reads in each wikipedia page to a java object
using basically this code
Find complete example on github.