Are IFilters necessary to index full text document

I am moving allong in my project and come to a crossroads dealing with the file content. I have successfully created a working index that has some classification fields but I am know looking to have keyword search applied to the file contents. My issue is I am not sure if passing lucene a reader would translate to the API indexing the entire file contents. I did some searching online and found suggestions that an IFilter would be needed is that true? It seems somewhat complicated. Anyways my code for indexing file contents is below and does not work(if a reader is passed it fails). Ideally, I would like to be able to process doc and docx files. Any help is much appreciated.

My code creating a reader

public void setFileText()
        {

            var FD = new System.Windows.Forms.OpenFileDialog();
            StreamReader reader;
            if (FD.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                string fileToOpen = FD.FileName;
                reader = new StreamReader(fileToOpen);
            }
            else
            {
                reader = null;
            }
            this.FileText =  reader;
        }
}

My code to add the document to the index

 private static void _addToLuceneIndex(MATS_Doc Data, IndexWriter writer)
        {
            // remove older index entry
        //    Query searchQuery = new TermQuery(new Term("Id", Data.Id.ToString()));
          //  writer.DeleteDocuments(searchQuery);

            // add new index entry
            Document doc = new Document();

            // add lucene fields mapped to db fields

            doc.Add(new Field("Id", Data.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Title))
                doc.Add(new Field("Title", Data.Title, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Plant))
                doc.Add(new Field("Plant", Data.Plant, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Containment))
                doc.Add(new Field("Containment", Data.Containment, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Part))
                doc.Add(new Field("Part", Data.Part, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Operation))
                doc.Add(new Field("Operation", Data.Operation, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Geometry))
                doc.Add(new Field("Geometry", Data.Geometry, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (Data.FileText != null)
                doc.Add(new Field("Text", Data.FileText));
            // add entry to index
            writer.AddDocument(doc);
        }

标签： c# full-text-search lucene.net ifilter

3条回答

劫难

2楼-- · 2019-08-05 11:17

Its actually very simple to use IFitlers.

I suggest using Eclipse.IndexingService (in c#).

Then all you have to do (besides installing the IFitlers if needed) is:

using (FilterReader filterReader = new FilterReader(path, Path.GetExtension(path)))
{
     filterReader.Init();
     string content = filterReader.ReadToEnd();
}

you can read more about IFitlers here:

http://www.codeproject.com/Articles/31944/Implementing-a-TextReader-to-extract-various-files

http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-08-05 11:20

Lucene by itself cannot process .doc and .docx files. Solr might be worth a look here, as Lucene itself is just a library for building search engines.

0人赞添加讨论(0) 举报

贪生不怕死

4楼-- · 2019-08-05 11:28

Another option that might be worth looking into is using RavenDB, which internally implements Lucene.Net for it's indexing engine. It looks like you are in a desktop app, so you should consider RavenDB's embedded mode.

You can then use my Indexed Attachments Bundle - which manages much of this for you. You simply upload a document as an attachment, and it takes care of extracting the text from it using IFilters. It builds an index over that text automatically. You can then do a full-text Lucene search on that index. If desired, you can even highlight the search terms found.

Documentation for the bundle is currently lacking, but you should be able to gather what you need from the unit tests.

0人赞添加讨论(0) 举报

Are IFilters necessary to index full text document

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间