I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query.
Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div>
and several others.
I need to, at some point, delete the content in these DIVS.
I am guessing that the right place is during indexing by solr but cannot work out how.
A pattern would look something like (<div id="calendar">).*?(<\/div>)
but i cannot get that to work in <tokenizer class="solr.PatternTokenizerFactory" pattern="(<div id="calendar">).*?(<\/div>)" />
and I am not really sure where to put it in schema.xml.
When I do put that pattern in schema.xml does not parse.
Here is a patch for SOLR that you can place in your indexing config to ignore the contents of tags you configure. It will only work with XML, though, so if you can tidy your HTML or you know that it is XHTML, then this would work, but it won't work with just any random HTML.
I think you have a few choices:
- extend the Nutch HTML parser, and add logic to strip the header out. (There might be better places to do this, like when you have the raw data but before the DOM is parsed)
- make your site smart enough to not draw the header when nutch is crawling. This is pretty easy to do by just checking the User-Agent value in the request header. You might need to do a better job of seeding your crawl since the links in the header won't be there to help nutch find the other pages
- Somehow get Solr to remove the header for the nutch data. I'm not sure how you'd do this, and I think this means you lose some of the Nutch/Solr synergies.
- Somehow edit the Nutch index (just a lucene index). In theory, you could just walk through all documents in the index and do a trimming on the correct property of each Document.
I would think the easiest way to do this, is to do #2 if you have a consistent way of drawing the header (ie a skin or a common include). Then perhaps #1 and #4. I think #3 would be the hardest, but I might be wrong.
A new feature has been introduced in Nutch 1.12 using apache tika parser which works on boilerpipe algorithm to strip off the header and footer content from html pages in parsing stage itself.
We can use following properties in nutch-site.xml to have this implemented :
<!-- parse-tika plugin properties -->
<property>
<name>tika.extractor</name>
<value>boilerpipe</value>
<description>
Which text extraction algorithm to use. Valid values are: boilerpipe or none.
</description>
</property>
<property>
<name>tika.extractor.boilerpipe.algorithm</name>
<value>DefaultExtractor</value>
<description>
Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
or CanolaExtractor.
</description>
</property>
Its working for me. Hope it will work for others as well...:)
For detailed overview, you can refer to this ticket :
https://issues.apache.org/jira/browse/NUTCH-961
If you want to do that I believe you should write a customized parser in nutch, such that the data to index does not contain the data.
Basically after parsing the text data is raw text without any structure.