How to “scan” a website (or page) for info, and br

2019-01-01 12:15发布

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).

For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?

What would this process even be called? I have no idea were to even begin researching this.

Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:

Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

I do have Apache Commons

10条回答
旧人旧事旧时光
2楼-- · 2019-01-01 12:32

Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.

查看更多
笑指拈花
3楼-- · 2019-01-01 12:35

You may use an html parser (many useful links here: java html parser).

The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.

查看更多
情到深处是孤独
4楼-- · 2019-01-01 12:36

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

查看更多
素衣白纱
5楼-- · 2019-01-01 12:36

jsoup supports java 1.5

https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3

looks like that stack was a bug, and has been fixed

查看更多
还给你的自由
6楼-- · 2019-01-01 12:39

JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf

As others have already mentioned the process is called scraping

查看更多
若你有天会懂
7楼-- · 2019-01-01 12:44

You could also try jARVEST.

It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites.

Example:

Find all links inside a web page (wget and xpath are constructs of the jARVEST's language):

wget | xpath('//a/@href')

Inside a Java program:

Jarvest jarvest = new Jarvest();
  String[] results = jarvest.exec(
    "wget | xpath('//a/@href')", //robot! 
    "http://www.google.com" //inputs
  );
  for (String s : results){
    System.out.println(s);
  }
查看更多
登录 后发表回答