可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Going to use Apache Nutch v1.3 to extract only some specific content from the webpages. Checked parse-html plugin. Seems it normalizes each html page using tagsoup or nekohtml. This is good. I need to extract only text inside <span class='xxx'> and <span class='yyy'> elemetns on the web-page. Would be great if extracted texts are saved into different fields (e.g. content_xxx, content_yyy). My question is: should I write my own plugin or this could be done using some standard way?

The best way would be apply XSLT on normalized web-page and get the result. Is that possible?

回答1:

Building your own ParsingFilter and IndexingFilter is easy. Nutch provides you with the DOM document, which you only need to traverse and search for your div. Then you simply add the new fields to your index and schema and your done.

There are some examples on how to do this:

http://wiki.apache.org/nutch/HowToMakeCustomSearch

http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Good luck

回答2:

By default the content is flat after parsing. So I don't think you can do what you want, unless you can get extract your content at the indexing step ie once content has been flattened.