Adding URL parameter to Nutch/Solr index and searc

2019-03-17 06:07发布

I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on).

the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?)
the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/)

The crawling works fine so far. Any ideas?

cheers, mana

EDIT:

A part of the solution is hidden here:

configuring nutch regex-normalize.xml

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

has to be modfied. One has to allow all chars that may exist in a URL parameter like '?' and '='. The new line looks like

-[*!@]

And pages are crawled now with params. But they are not yet send to Solr with parameters (Solr still cuts the parameters from the links)

EDIT2:

Nutch has some issues on how to handle relative urls ('?param=value'). Still stuck on that Parameter thing:

see maling list: http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links

标签： solr web-crawler nutch

1条回答

甜甜的少女心

2楼-- · 2019-03-17 06:35

You could create a custom field in a Nutch filter to save the entire URL. As long as you define the same field in the Solr schema with store="true" it will show up in your results. See WritingPluginExample-1.2.

Let me know if you'd like some help.

0人赞添加讨论(0) 举报

Adding URL parameter to Nutch/Solr index and searc

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间