I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on).
- the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?)
- the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/)
The crawling works fine so far. Any ideas?
cheers, mana
A part of the solution is hidden here:
configuring nutch regex-normalize.xml
# skip URLs containing certain characters as probable queries, etc.
has to be modfied. One has to allow all chars that may exist in a URL parameter like '?' and '='. The new line looks like
And pages are crawled now with params. But they are not yet send to Solr with parameters (Solr still cuts the parameters from the links)
Nutch has some issues on how to handle relative urls ('?param=value'). Still stuck on that Parameter thing:
see maling list: http://search.lucidimagination.com/search/document/b6011a942b323ba3/problem_with_href_param_value_links
You could create a custom field in a Nutch filter to save the entire URL. As long as you define the same field in the Solr schema with store="true" it will show up in your results. See WritingPluginExample-1.2.
Let me know if you'd like some help.