I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,
<h><title> title to search </title></h>
<div id="abc">
content to search
</div>
<div class="efg">
other content to search
</div>
I want to parse div element with id ="abc" and class="efg" and so on.
I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.
I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.
You can use this plugin to extract data from your pages based on css rules:
https://github.com/BayanGroup/nutch-custom-search
In your example, you can configure it in this way: