Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).
For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?
What would this process even be called? I have no idea were to even begin researching this.
Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)
I do have Apache Commons
Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes,
Elements
, implementsIterable
so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verboseNode
andNodeList
like classes in the average Java DOM parser).Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):
As you might have guessed, this prints your own question and the names of all answerers.
You'd probably want to look at the HTML to see if you can find strings that are unique and near your text, then you can use line/char-offsets to get to the data.
Could be awkward in Java, if there aren't any XML classes similar to the ones found in
System.XML.Linq
in C#.My answer won't probably be useful to the writer of this question (I am 8 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer.
Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo
How to use : Basics
Imagine we need to parse the following html page :
Let's create the POJOs we want to map it to :
And now the
Meal
class as well :We provided some more explanations on the above code on our github page.
For the moment, let's see how to scrap this.
Another short example can be found here
Hope this will help someone out there!
I would use JTidy - it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.
EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.