Unable to verify crawled data stored in hbase

2019-08-05 02:05发布

I have crawled website using 'nutch' with HBase as a storage back-end. I have referred this tutorial link- http://wiki.apache.org/nutch/Nutch2Tutorial.

Nutch version is 2.2.1, HBase version 0.90.4 and Solr version 4.7.1

Here are the steps I used-

./runtime/local/bin/nutch inject urls

./runtime/local/bin/nutch generate -topN 100 -adddays 30

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch fetch -all

./runtime/local/bin/nutch updatedb

./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all

My url/seed.txt file contains- http://www.xyzshoppingsite.com/mobiles/

And I have kept only below line in 'regex-urlfilter.txt' file (all other regex are commented).

+^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/*

At the end of the crawl, I can see a table "webpage" created in the HBase but I am unable to verify whether all and complete data have been crawled or not. When searched in Solr, it shows nothing, 0 result.

My ultimate intention is to get the complete data present in all pages under mobile in above URL.

Could you please let me know,

  • How to verify crawled data present in HBase?

  • Solr log directory contains 0 files so I am unable to get a breakthrough. How to resolve this?

  • Output of HBase command scan "webpage" shows only timestamp data and other data as

    'value=\x0A\x0APlease Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>Please Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>'

Here, why is the data crawled like this and not the actual contents of page after redirection?

Please help. Thanks in advance.

Thanks and Regards!

1条回答
该账号已被封号
2楼-- · 2019-08-05 02:14

Instead of executing all those steps, can you use below command

./bin/crawl url/seed.txt shoppingcrawl http://localhost:8080/solr 2

If you are able to execute successfully, a table will be created in hbase , with name, shoppingcrawl_webpage.

we can check by executing below command in hbase shell

hbase> list

Then we can scan for specific table. In this case

 hbase> scan 'shoppingcrawl_webpage'
查看更多
登录 后发表回答