How to “scan” a website (or page) for info, and br

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).

For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?

What would this process even be called? I have no idea were to even begin researching this.

Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:

Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

I do have Apache Commons

标签： java html web-scraping jsoup

10条回答

旧人旧事旧时光

2楼-- · 2019-01-01 12:32

Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.

0人赞添加讨论(0) 举报

笑指拈花

3楼-- · 2019-01-01 12:35

You may use an html parser (many useful links here: java html parser).

The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.

0人赞添加讨论(0) 举报

情到深处是孤独

4楼-- · 2019-01-01 12:36

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

0人赞添加讨论(0) 举报

素衣白纱

5楼-- · 2019-01-01 12:36

jsoup supports java 1.5

https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3

looks like that stack was a bug, and has been fixed

0人赞添加讨论(0) 举报

还给你的自由

6楼-- · 2019-01-01 12:39

JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf

As others have already mentioned the process is called scraping

0人赞添加讨论(0) 举报

若你有天会懂

7楼-- · 2019-01-01 12:44

You could also try jARVEST.

It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites.

Example:

Find all links inside a web page (wget and xpath are constructs of the jARVEST's language):

wget | xpath('//a/@href')

Inside a Java program:

Jarvest jarvest = new Jarvest();
  String[] results = jarvest.exec(
    "wget | xpath('//a/@href')", //robot! 
    "http://www.google.com" //inputs
  );
  for (String s : results){
    System.out.println(s);
  }

0人赞添加讨论(0) 举报

1 2 下一页

How to “scan” a website (or page) for info, and br

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间