I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...
Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx...
Take a look at solr search server and nutch (crawler), both are related to the lucene project.
What you're asking is two components:
First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.
Web crawler
So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.
The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.
Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.
In terms of API calls, you should have something like
HttpCrawler#getDocuments(String url)
which returns aList<YourBean>
to use in conjuction with the actual indexer.Lucene-based automated indexer
Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example
addDoc(..)
method and just replace the String withYourBean
.Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling
IndexWriter#commit()
only after a bunch of documents have been added to index is good for performance and then callingIndexWriter#optimize()
to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessaryLockObtainFailedException
s to be thrown, as with all IO in Java such operation should of course be done in thefinally
block.Caveats
[0 to 5]
actually gets transformed into+0 +1 +2 +3 +4 +5
which means the range query dies out very quickly because there's a maximum number of query sub parts.With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.