可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions.

How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions)
What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing.

回答1:

This is How your program 'visit' or 'connect' to web pages.

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        dis = new DataInputStream(new BufferedInputStream(is));

        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }

This will download source of html page.

For HTML parsing see this

Also take a look at jSpider and jsoup

回答2:

Crawler4j is the best solution for you,

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!

Also visit. for more java based web crawler tools and brief explanation for each.

回答3:

For parsing content, I'm using Apache Tika.

回答4:

Right now there is a inclusion of many java based HTML parser that support visiting and parsing the HTML pages.

Jsoup
Jaunt API
HtmlCleaner
JTidy
NekoHTML
TagSoup

Here's the complete list of HTML parser with basic comparison.

回答5:

I recommend you to use the HttpClient library. You can found examples here.

回答6:

I would prefer crawler4j. Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in few hours.

回答7:

Have a look at these existing projects if you want to learn how it can be done:

Apache Nutch
crawler4j
gecco
Norconex HTTP Collector
vidageek crawler
webmagic
Webmuncher

A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). Though the devil is in the details, i.e. how to be "polite" and respect robots.txt, meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc.

^{Flow diagram courtesy of Norconex HTTP Collector.}

回答8:

You can explore.apache droid or apache nutch to get the feel of java based crawler

回答9:

Though mainly used for Unit Testing web applications, HttpUnit traverses a website, clicks links, analyzes tables and form elements, and gives you meta data about all the pages. I use it for Web Crawling, not just for Unit Testing. - http://httpunit.sourceforge.net/

回答10:

I think jsoup is better than others, jsoup runs on Java 1.5 and up, Scala, Android, OSGi, and Google App Engine.

回答11:

Here is a list of available crawler:

https://java-source.net/open-source/crawlers

But I suggest using Apache Nutch

Java Web Crawler Libraries

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

回答9:

回答10:

回答11:

收藏的人(0)

Java Web Crawler Libraries

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

回答9:

回答10:

回答11:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮