Unable to verify crawled data stored in hbase

I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial.

Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1

Here are the steps I used-

./runtime/local/bin/nutch inject urls

./runtime/local/bin/nutch generate -topN 100 -adddays 30

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch updatedb

./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all

My url/seed.txt file contains- http://www.xyzshoppingsite.com/mobiles/

And I have kept only below line in 'regex-urlfilter.txt' file (all other regex are commented).

+^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/*

At the end of the crawl, I can see a table "webpage" created in the HBase but I am unable to verify whether all and complete data have been crawled or not. When searched in Solr, it shows nothing, 0 result.

My ultimate intention is to get the complete data present in all pages under mobile in above URL.

Could you please let me know,

How to verify crawled data present in HBase?
Solr log directory contains 0 files so I am unable to get a breakthrough. How to resolve this?
Output of HBase command scan "webpage" shows only timestamp data and other data as

'value=\x0A\x0APlease Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>Please Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>'

Here, why is the data crawled like this and not the actual contents of page after redirection?

Please help. Thanks in advance.

Thanks and Regards!

标签： solr hbase nutch web-crawler

1条回答

该账号已被封号

2楼-- · 2019-08-05 02:14

Instead of executing all those steps, can you use below command

./bin/crawl url/seed.txt shoppingcrawl http://localhost:8080/solr 2

If you are able to execute successfully, a table will be created in hbase , with name, shoppingcrawl_webpage.

we can check by executing below command in hbase shell

hbase> list

Then we can scan for specific table. In this case

 hbase> scan 'shoppingcrawl_webpage'

0人赞添加讨论(0) 举报

Unable to verify crawled data stored in hbase

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间