So i'm following the Storm-Crawler-ElasticSearch tutorial and playing around with it.
When Kibana is used to search I've noticed that number of hits for index name 'status' is far greater than 'index'.
Example:
On the top left, you can see that there's 846 hits for 'status' index I assume that means it has crawled through 846 pages.
Now with 'index' index, it is shown that there are only 31 hits.
I understand that functionallyn index and status are different as status is just responsible for the link meta data. The problem is that it seem that StormCrawler is parsing through many pages and not indexing them.
So what I would like to have is the same amount of hits on 'index' too with the content displayed. Instead of just 31.
The 'status' index contains the information about all the URLs the crawler either fetched or discovered. This is roughly the equivalent of the crawldb in Nutch.The 'index' index contains the pages that have been fetched, parsed and, well, indexed.
Now if you look at the 'status' field within the status index, you'll find that there are different values indicating whether a URL has been DISCOVERED, FETCHED etc... See WIKI about status stream.
The ones marked as DISCOVERED haven't yet been fetched and therefore can't be in the 'index' index. If you filter the content of the status index by status:FETCHED you should see a number comparable to the target index.
The Elasticsearch module in SC contains templates for kibana that allow you to see the breakdown of URLs per status. If you haven't done so already, I'd recommend that you look at the video tutorials on Youtube.
So what I would like to have is the same amount of hits on 'index' too with the content displayed. Instead of just 31.
It will eventually get there, you just need to give time to the crawler to do its job (and do so politely). Bear in mind that a crawler discovers URLs quicker than it fetches them. Before you ask about speed, please read the FAQ.
Redirections and Fetch Errors are another possible reason for a difference. They exist in the status index but not in content index.