how to parse html with nutch and index specific ta

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?

标签： solr nutch apache-tika

4条回答

我只想做你的唯一

2楼-- · 2020-03-24 07:37

u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...

0人赞添加讨论(0) 举报

祖国的老花朵

3楼-- · 2020-03-24 07:48

You may want to check Nutch Plugin which should allow you to extract an element from a web page.

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2020-03-24 07:50

You can use one of these custom plugins to parse xml files based on xpath (or css selectors):

0人赞添加讨论(0) 举报

不美不萌又怎样

5楼-- · 2020-03-24 07:57

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
in your plugin extend the ParseFilter and IndexingFilter.
in YourParseFilter you can use NodeWalker to find your specific div
your parsed informations put into page metadata like this

page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
- Solr config file schema.xml (and restart Solr)
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch config file schema.xml (don't know if it is realy neccessary)
- Nutch config file solrindex-mapping.xml
field dest="your_specific_tag" source="your_specific_tag"

0人赞添加讨论(0) 举报

how to parse html with nutch and index specific ta

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间