I am trying to extract the metatags of HTML files and indexing them into solr with tika integration. I am not able to extract those metatags with Tika and not able to display in solr.
My HTML file is look like this.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>
<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
<span class="listterm">Length: </span>13 to 15 feet<br>
<span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
<span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
<span class="listterm">Diet: </span>leaves and branches of trees<br>
<span class="listterm">Number of Young: </span>1<br>
<span class="listterm">Home: </span>Sahara<br>
</p>
</p>
My data-config.xml file look like this
<dataConfig>
<dataSource name="bin" type="BinFileDataSource" />
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="/path/to/html/files/"
fileName=".*html|xml" onError="skip"
recursive="false">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size"/>
<field column="file" name="filename"/>
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor"
url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="product_id" name="product_id" meta="true"/>
<field column="assetid" name="assetid" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="type" name="type" meta="true"/>
<field column="first" name="first" meta="true"/>
<field column="category" name="category" meta="true"/>
</entity>
</entity>
</document>
</dataConfig>
In my schema.xml file I have added the following fields.
<field name="product_id" type="string" indexed="true" stored="true"/>
<field name="assetid" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="category" type="string" indexed="true" stored="true"/>
<field name="first" type="text_general" indexed="true" stored="true"/>
In my solrconfing.xml file I have added the following code.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" />
<lst name="defaults">
<str name="config">/path/to/data-config.xml</str>
</lst>
can anyone know how to extract those metatags from the HTML files and index them in solr and Tika? your help will be appreciated.
I don't think meta="true" means what you think it means. It usually refers to things that are about the file rather than the content. So, content-type, etc. Possibly http-equiv will get mapped as well.
Other than that, you need to extract actual content. You can do it by using format="xml" and then putting an inner entity with XPathEntityProcessor and mapping the path then. Except, even then, you are limited because stuck because AFAIK, DIH uses DefaultHtmlMapper which is extremely restrictive in what it let's through and skips most of the 'class' and 'id' attributes and even things like 'div'. You can read the list of allowed elements and attributes by yourself in the source code.
Frankly, your easier path is to have a SolrJ client and manage Tika yourself. Then you can set it to use IdentityHtmlMapper which does not muck about with HTML.
Which version of Solr you are using? If you are using Solr 4.0 or above then tika is embedded into it. Tika communicates with the the solr using the 'Solr-Cells' 'ExtractingRequestHandler' class that is configured in the solrconfig.xml as follows:
Now in solr by default as you can see in above configuration, any fields extracted from HTML document that is not declared in schema.xml is prefixed with 'ignored_' i.e. they are mapped to 'ignored_*' dynamic field inside schema.xml. The default schema.xml that reads as follows:
And following is how 'ignored' types are treated:
So, metadata extracted by tika is by default put in 'ignored' field by the Solr-Cell and thats why they are ignored for indexing and storing. Therefore, to index and store the metadatas you either change the "uprefix=attr_" or 'create specific fields or dynamic field' for your known metadatas and treat them as you want.
So, here is the corrected solrconfig.xml: