Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.
Is there any way to speed this up? A better method?
Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.
Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000
Main:
public class Read {
public static void main(String[] args) {
pages = XMLManager.getPages();
}
}
XMLManager
public class XMLManager {
public static ArrayList<Page> getPages() {
ArrayList<Page> pages = null;
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("..\\enwiki-20140811-pages-articles.xml");
PageHandler pageHandler = new PageHandler();
parser.parse(file, pageHandler);
pages = pageHandler.getPages();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pages;
}
}
PageHandler
public class PageHandler extends DefaultHandler{
private ArrayList<Page> pages = new ArrayList<>();
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(){
super();
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
stringBuilder = new StringBuilder();
if (qName.equals("page")){
page = new Page();
idSet = false;
} else if (qName.equals("redirect")){
if (page != null){
page.setRedirecting(true);
}
}
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (page != null && !page.isRedirecting()){
if (qName.equals("title")){
page.setTitle(stringBuilder.toString());
} else if (qName.equals("id")){
if (!idSet){
page.setId(Integer.parseInt(stringBuilder.toString()));
idSet = true;
}
} else if (qName.equals("text")){
String articleText = stringBuilder.toString();
articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}", " "); //remove links underneath headings
articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
articleText = articleText.replaceAll("\\|", " "); //Separate multiple links
articleText = articleText.replaceAll("\\n", " "); //remove new lines
articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]", " "); //remove all non alphanumeric except dashes and spaces
articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space
Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
Matcher matcher = pattern.matcher(articleText);
matcher.find();
try {
page.setSummaryText(matcher.group());
} catch (IllegalStateException se){
page.setSummaryText("None");
}
page.setText(articleText);
} else if (qName.equals("page")){
pages.add(page);
page = null;
}
} else {
page = null;
}
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
stringBuilder.append(ch,start, length);
}
public ArrayList<Page> getPages() {
return pages;
}
}
Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList
.
You need some sort of pipeline to pass the data on to its actual destination without ever
store it all in memory at once.
What I've sometimes done for this sort of situation is similar to the following.
Create an interface for processing a single element:
public interface PageProcessor {
void process(Page page);
}
Supply an implementation of this to the PageHandler
through a constructor:
public class Read {
public static void main(String[] args) {
XMLManager.load(new PageProcessor() {
@Override
public void process(Page page) {
// Obviously you want to do something other than just printing,
// but I don't know what that is...
System.out.println(page);
}
}) ;
}
}
public class XMLManager {
public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);
parser.parse(file, pageHandler);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Send data to this processor instead of putting it in the list:
public class PageHandler extends DefaultHandler {
private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(PageProcessor processor) {
this.processor = processor;
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//Unchanged from your implementation
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
//Unchanged from your implementation
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change
} else if (qName.equals("page")){
processor.process(page);
page = null;
}
} else {
page = null;
}
}
}
Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler
collect pages locally in a smaller list and periodically send the list off for processing and clear the list.
Or (perhaps better) you could implement the PageProcessor
interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.
Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType
has its Java POJO
equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.
XML2J uses mapping of complexTypes
to Java POJOs on the one hand, but lets you specify events you want to listen on.
E.g.
account/@process = true
account/accounts/@process = true
account/accounts/@detach = true
The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.
class AccountType {
private List<AccountType> accounts = new ArrayList<>();
public void addAccount(AccountType tAccount) {
accounts.add(tAccount);
}
// etc.
};
In your code you need to implement the process method (by default the code generator generates an empty method:
class AccountsProcessor implements MessageProcessor {
static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);
// assuming Spring data persistency here
final String path = new ClassPathResource("spring-config.xml").getPath();
ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext(path);
AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);
@Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
if (evt == XMLEvent.END) {
if( data instanceof AccountType) {
process((AccountType)data);
}
}
}
private void process(AccountType data) {
if (logger.isInfoEnabled()) {
// do some logging
}
repo.save(data);
}
}
Note that XMLEvent.END
marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN
for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END
you would then update the parent.
Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.
There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.
The default process method is like this:
@Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
/*
* TODO Auto-generated method stub implement your own handling here.
* Use the runtime configuration file to determine which events are to be sent to the processor.
*/
if (evt == XMLEvent.END) {
data.print( ConsoleWriter.out );
}
}
Downloads:
- https://github.com/lolkedijkstra/xml2j-core
- https://github.com/lolkedijkstra/xml2j-gen
- https://sourceforge.net/projects/xml2j/
First mvn clean install
the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME
as per directions in the usermanual.