Some help scraping a page in Java

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.

I've tried reading the documentation but it seems too extensive and I don't know where to begin.

Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.

Thanks.

标签： java html xhtml screen-scraping

4条回答

老娘就宠你

2楼-- · 2019-02-17 11:01

You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.

0人赞添加讨论(0) 举报

神经病院院长

3楼-- · 2019-02-17 11:08

Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.

0人赞添加讨论(0) 举报

虎瘦雄心在

4楼-- · 2019-02-17 11:14

If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.

0人赞添加讨论(0) 举报

Fickle 薄情

5楼-- · 2019-02-17 11:16

Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.

Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.

Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.

0人赞添加讨论(0) 举报

Some help scraping a page in Java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间