I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial
.
Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1
Here are the steps I used-
./runtime/local/bin/nutch inject urls
./runtime/local/bin/nutch generate -topN 100 -adddays 30
./runtime/local/bin/nutch fetch -all
./runtime/local/bin/nutch fetch -all
./runtime/local/bin/nutch updatedb
./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all
My url/seed.txt file contains-
http://www.xyzshoppingsite.com/mobiles/
And I have kept only below line in 'regex-urlfilter.txt' file (all other regex are commented).
+^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/*
At the end of the crawl, I can see a table "webpage" created in the HBase but I am unable to verify whether all and complete data have been crawled or not. When searched in Solr, it shows nothing, 0 result.
My ultimate intention is to get the complete data present in all pages under mobile in above URL.
Could you please let me know,
How to verify crawled data present in HBase?
Solr log directory contains 0 files so I am unable to get a breakthrough. How to resolve this?
Output of HBase command
scan "webpage"
shows only timestamp data and other data as'
value=\x0A\x0APlease Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>Please Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>
'
Here, why is the data crawled like this and not the actual contents of page after redirection?
Please help. Thanks in advance.
Thanks and Regards!
Instead of executing all those steps, can you use below command
If you are able to execute successfully, a table will be created in hbase , with name, shoppingcrawl_webpage.
we can check by executing below command in hbase shell
Then we can scan for specific table. In this case