How to parse content located in specific HTML tags

2019-04-08 00:19发布

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,

  <h><title> title to search </title></h>
   <div id="abc">
        content to search
   </div>
   <div class="efg">
        other content to search
   </div>

I want to parse div element with id ="abc" and class="efg" and so on.

I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.

I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.

标签： nutch

1条回答

Ridiculous、

2楼-- · 2019-04-08 00:25

You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it in this way:

<config>
    <fields>
        <field name="custom_content" />
    </fields>
    <documents>
        <document url=".+" engine="css">
            <extract-to field="custom_content">
                <text>
                    <expr value="#abc" />
                </text>
                <text>
                    <expr value=".efg" />
                </text>
            </extract-to>
        </document>
    </documents>
</config>

0人赞添加讨论(0) 举报

How to parse content located in specific HTML tags

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间