How to index text files using apache solr

2019-03-22 08:24发布

问题:

I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example ....... Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id ....... So that i can easily retrieve the matching line.

Can this be done in Apache Tika?

回答1:

Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found @ link

You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.

Edit :-
Tika does not convert the text file to XML before sneding it to Solr. Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.

You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.



回答2:

You may want to look at DataImportHandler to parse the file into lines or entries. It is a better match than running Tika on something that already has internal structure.