Going to use Apache Nutch v1.3
to extract only some specific content from the webpages. Checked parse-html plugin. Seems it normalizes each html page using tagsoup or nekohtml. This is good. I need to extract only text inside <span class='xxx'>
and <span class='yyy'>
elemetns on the web-page. Would be great if extracted texts are saved into different fields (e.g. content_xxx
, content_yyy
).
My question is: should I write my own plugin or this could be done using some standard way?
The best way would be apply XSLT on normalized web-page and get the result. Is that possible?
Building your own ParsingFilter and IndexingFilter is easy. Nutch provides you with the DOM document, which you only need to traverse and search for your div. Then you simply add the new fields to your index and schema and your done.
There are some examples on how to do this:
http://wiki.apache.org/nutch/HowToMakeCustomSearch
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
Good luck
By default the content is flat after parsing.
So I don't think you can do what you want, unless you can get extract your content at the indexing step ie once content has been flattened.
Instead of writing your own plugins, you can also use these custom plugins which can be configured to extract parts of pages:
- https://github.com/BayanGroup/nutch-custom-search
- http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/