How can I do indexing .html files in SOLR

The files I want to do indexing is stored on the server(I don't need to crawl). /path/to/files/ the sample HTML file is

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="product_id" content="11"/>
<meta name="assetid" content="10001"/>
<meta name="title" content="title of the article"/>
<meta name="type" content="0xyzb"/>
<meta name="category" content="article category"/>
<meta name="first" content="details of the article"/>

<h4>title of the article</h4>
<p class="link"><a href="#link">How cite the Article</a></p>
<p class="list">
  <span class="listterm">Length: </span>13 to 15 feet<br>
  <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
  <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
  <span class="listterm">Diet: </span>leaves and branches of trees<br>
  <span class="listterm">Number of Young: </span>1<br>
  <span class="listterm">Home: </span>Sahara<br>

</p>
</p>

I have added the request handler in solrconfing.xml file.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">/path/to/data-config.xml</str>
</lst>

My data-config.xml is look like this

<dataConfig>
<dataSource type="FileDataSource" />
<document>
    <entity name="f" processor="FileListEntityProcessor" baseDir="/path/to html/files/" fileName=".*html" recursive="true" rootEntity="false" dataSource="null">
        <field column="plainText" name="text"/>
    </entity>
</document>
</dataConfig>

I have kept the default schema.xml file and added the following piece of code to schema.xml file.

 <field name="product_id" type="string" indexed="true" stored="true"/>
 <field name="assetid" type="string" indexed="true" stored="true" required="true" />
 <field name="title" type="string" indexed="true" stored="true"/>
 <field name="type" type="string" indexed="true" stored="true"/>
 <field name="category" type="string" indexed="true" stored="true"/>
 <field name="first" type="text_general" indexed="true" stored="true"/>

 <uniqueKey>assetid</uniqueKey>

when I tried to do the full import after setting it up it shows that all html files fetched. But when I search in SOLR it didn't show me any result. Anyone have idea what could be possible cause?

My understanding is all the files fetched correctly but not indexed in SOLR. Does anyone know how can I indexed those meta tags and content of the HTML file in SOLR?

your reply will be appreciated.

标签： solr full-text-indexing dataimporthandler data-import solr4

4条回答

叛逆

2楼-- · 2020-07-10 08:03

Did you mean to have fileName="*.html" in your data-config.xml? You now have fileName=".*html"

I am pretty certain Solr won't know how to translate your meta fields from your html into index fields. I haven't tried.

I have created programs to read (x)html (using xpath), however. This will create a formatted xml file to send to \update. At this point, you should be able use dataimporthandler to look for that formatted xml file(s).

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

3楼-- · 2020-07-10 08:05

Here is a full example converting HTML to text and extracting relevant metadata:

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNull;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.junit.Test;

import java.io.ByteArrayInputStream;

public class ConversionTest {

    @Test
    public void testHtmlToTextConversion() throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(("<html>\n" +
            "<head>\n" +
            "<title> \n" +
            " A Simple HTML Document\n" +
            "</title>\n" +
            "</head>\n" +
            "<body></div>\n" +
            "<p>This is a very simple HTML document</p>\n" +
            "<p>It only has two paragraphs</p>\n" +
            "</body>\n" +
            "</html>").getBytes());
        BodyContentHandler contenthandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        parser.parse(bais, contenthandler, metadata, new ParseContext());
        assertEquals("\nThis is a very simple HTML document\n" + 
            "\n" + 
            "It only has two paragraphs\n" + 
            "\n", contenthandler.toString().replace("\r", ""));
        assertEquals("A Simple HTML Document", metadata.get("title"));
        assertEquals("A Simple HTML Document", metadata.get("dc:title"));
        assertNull(metadata.get("title2"));
        assertEquals("org.apache.tika.parser.DefaultParser", metadata.getValues("X-Parsed-By")[0]);
        assertEquals("org.apache.tika.parser.html.HtmlParser", metadata.getValues("X-Parsed-By")[1]);
        assertEquals("ISO-8859-1", metadata.get("Content-Encoding"));
        assertEquals("text/html; charset=ISO-8859-1", metadata.get("Content-Type"));
    }
}

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2020-07-10 08:07

The easiest way is to use post tool from bin directory. It will do all job automatically. Here is example

./post -c conf1 /path/to/files/*

More info is here

0人赞添加讨论(0) 举报

三岁会撩人

5楼-- · 2020-07-10 08:09

You can use Solr Extracting Request Handler to feed Solr with the HTML file and extract contents from the html file. e.g. at link

Solr uses Apache Tika to extract contents from the uploaded html file

Nutch with Solr is a wider solution if you want to Crawl websites and have it indexed.
Nutch with Solr Tutorial will get you started.

0人赞添加讨论(0) 举报

How can I do indexing .html files in SOLR

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间