I'm pretty much a beginner in programming, currently trying to build my first web scraper using JSoup. So far I am able to get the data that I want from a single page of my target site, but naturally I would like to somehow iterate over the entire site.
JSoup seems to offer some kind of traversor/visitor (what's the difference?) for that, yet I have absolutely no idea how to make that work. I know what trees and nodes are and know the structure of my target site, but I don't know how to create (?) a traverser/visitor-object(?) and let it run over my site. Could it be that there is some advanced Java/oo magic at work, that I don't know of?
Unfortunately neither the Jsoup cookbook nor other threads seem to really cover the details, so if someone could nudge me in the right direction I'd be very thankful.
JSoup seems to offer some kind of traverser/visitor (what's the difference?)
The NodeTraversor
will efficiently iterate through all nodes under and including a specified root node. It doesn't use recursion so large DOM won't create a stackoverflow.
The NodeVisitor
(NV) is the companion of NodeTraversor
(NT). Each time NT enters a node it calls the head
method of the NV. Each time NT leaves a node, it calls the tail
method of the NV.
NT is ready made and provided to you bythe Jsoup API. All you have to do is to provide NT a NV implementation.
Here is a real life implementation of NodeVisitor taken from ElasticSearch source code:
protected static String convertElementsToText(Elements elements) {
if (elements == null || elements.isEmpty())
return "";
StringBuilder buffer = new StringBuilder();
NodeTraversor nt = new NodeTraversor(new ToTextNodeVisitor(buffer));
for (Element element : elements) {
nt.traverse(element);
}
return buffer.toString().trim();
}
private static final class ToTextNodeVisitor implements NodeVisitor {
final StringBuilder buffer;
ToTextNodeVisitor(StringBuilder buffer) {
this.buffer = buffer;
}
@Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim(); // non breaking space
if (!text.isEmpty()) {
buffer.append(text);
if (!text.endsWith(" ")) {
buffer.append(" ");
}
}
}
}
@Override
public void tail(Node node, int depth) {
}
}