i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:
<div id=something>
me specific tag
</div>
indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.
any idea?
u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...
You may want to check Nutch Plugin which should allow you to extract an element from a web page.
You can use one of these custom plugins to parse xml files based on xpath (or css selectors):
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
Here are some tips to plugin:
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
field name="your_specific_tag" type="string" stored="true" indexed="true"
field dest="your_specific_tag" source="your_specific_tag"