How to parse content located in specific HTML tags

2019-04-07 23:58发布

问题:

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,

  <h><title> title to search </title></h>
   <div id="abc">
        content to search
   </div>
   <div class="efg">
        other content to search
   </div>

I want to parse div element with id ="abc" and class="efg" and so on.

I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.

I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.

回答1:

You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it in this way:

<config>
    <fields>
        <field name="custom_content" />
    </fields>
    <documents>
        <document url=".+" engine="css">
            <extract-to field="custom_content">
                <text>
                    <expr value="#abc" />
                </text>
                <text>
                    <expr value=".efg" />
                </text>
            </extract-to>
        </document>
    </documents>
</config>