问题:

I have crawled websites using Nutch and I have pushed crawled data to solr. Now I want to search content between specific tag with specific attribute value. For example,

 <h><title> title to search </title></h>
 <div id="abc">
     content to search
 </div>
 <div class="efg">
     other content to search
 </div>

I have seen this question(how to parse html with nutch and index specific tag to solr?) but this does not have enough clarity.

I want to know that whether there is any plugin available or i need to write a customized plugin altogether. If i have to write a plugin, i just need few directions for handling html tags and attributes.

回答1:

You could use the HTMLStripCharFilterFactory in your analyzer before tokenizing.

This filter strips HTML from the input stream. For more info have a look here

回答2:

You can implement a Nutch filter (I like Jericho HTML Parser) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.