Apache Nutch to index only part of page content

2019-04-02 07:24发布

问题:

Going to use Apache Nutch v1.3 to extract only some specific content from the webpages. Checked parse-html plugin. Seems it normalizes each html page using tagsoup or nekohtml. This is good. I need to extract only text inside <span class='xxx'> and <span class='yyy'> elemetns on the web-page. Would be great if extracted texts are saved into different fields (e.g. content_xxx, content_yyy). My question is: should I write my own plugin or this could be done using some standard way?

The best way would be apply XSLT on normalized web-page and get the result. Is that possible?

回答1:

Building your own ParsingFilter and IndexingFilter is easy. Nutch provides you with the DOM document, which you only need to traverse and search for your div. Then you simply add the new fields to your index and schema and your done.

There are some examples on how to do this:

http://wiki.apache.org/nutch/HowToMakeCustomSearch

http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Good luck



回答2:

By default the content is flat after parsing. So I don't think you can do what you want, unless you can get extract your content at the indexing step ie once content has been flattened.



回答3:

Instead of writing your own plugins, you can also use these custom plugins which can be configured to extract parts of pages:

  • https://github.com/BayanGroup/nutch-custom-search
  • http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/


标签: solr nutch