Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).
For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?
What would this process even be called? I have no idea were to even begin researching this.
Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)
I do have Apache Commons
Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.
You may use an html parser (many useful links here: java html parser).
The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.
This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.
jsoup supports java 1.5
https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3
looks like that stack was a bug, and has been fixed
JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf
As others have already mentioned the process is called scraping
You could also try jARVEST.
It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites.
Example:
Find all links inside a web page (
wget
andxpath
are constructs of the jARVEST's language):Inside a Java program: