Java Web Crawler Libraries

2019-02-01 05:40发布

问题:

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions.

  1. How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions)

  2. What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing.

回答1:

This is How your program 'visit' or 'connect' to web pages.

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        dis = new DataInputStream(new BufferedInputStream(is));

        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }

This will download source of html page.

For HTML parsing see this

Also take a look at jSpider and jsoup



回答2:

Crawler4j is the best solution for you,

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!

Also visit. for more java based web crawler tools and brief explanation for each.



回答3:

For parsing content, I'm using Apache Tika.



回答4:

Right now there is a inclusion of many java based HTML parser that support visiting and parsing the HTML pages.

  • Jsoup
  • Jaunt API
  • HtmlCleaner
  • JTidy
  • NekoHTML
  • TagSoup

Here's the complete list of HTML parser with basic comparison.



回答5:

I recommend you to use the HttpClient library. You can found examples here.



回答6:

I would prefer crawler4j. Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in few hours.



回答7:

Have a look at these existing projects if you want to learn how it can be done:

  • Apache Nutch
  • crawler4j
  • gecco
  • Norconex HTTP Collector
  • vidageek crawler
  • webmagic
  • Webmuncher

A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). Though the devil is in the details, i.e. how to be "polite" and respect robots.txt, meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc.

Flow diagram courtesy of Norconex HTTP Collector.



回答8:

You can explore.apache droid or apache nutch to get the feel of java based crawler



回答9:

Though mainly used for Unit Testing web applications, HttpUnit traverses a website, clicks links, analyzes tables and form elements, and gives you meta data about all the pages. I use it for Web Crawling, not just for Unit Testing. - http://httpunit.sourceforge.net/



回答10:

I think jsoup is better than others, jsoup runs on Java 1.5 and up, Scala, Android, OSGi, and Google App Engine.



回答11:

Here is a list of available crawler:

https://java-source.net/open-source/crawlers

But I suggest using Apache Nutch