I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions.
How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions)
What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing.
You can explore.apache droid or apache nutch to get the feel of java based crawler
I recommend you to use the HttpClient library. You can found examples here.
For parsing content, I'm using Apache Tika.
I think jsoup is better than others, jsoup runs on Java 1.5 and up, Scala, Android, OSGi, and Google App Engine.
Here is a list of available crawler:
https://java-source.net/open-source/crawlers
But I suggest using Apache Nutch
This is How your program 'visit' or 'connect' to web pages.
This will download source of html page.
For HTML parsing see this
Also take a look at jSpider and jsoup