Are IFilters necessary to index full text document

2019-08-05 11:03发布

I am moving allong in my project and come to a crossroads dealing with the file content. I have successfully created a working index that has some classification fields but I am know looking to have keyword search applied to the file contents. My issue is I am not sure if passing lucene a reader would translate to the API indexing the entire file contents. I did some searching online and found suggestions that an IFilter would be needed is that true? It seems somewhat complicated. Anyways my code for indexing file contents is below and does not work(if a reader is passed it fails). Ideally, I would like to be able to process doc and docx files. Any help is much appreciated.

My code creating a reader

public void setFileText()
        {

            var FD = new System.Windows.Forms.OpenFileDialog();
            StreamReader reader;
            if (FD.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                string fileToOpen = FD.FileName;
                reader = new StreamReader(fileToOpen);
            }
            else
            {
                reader = null;
            }
            this.FileText =  reader;
        }
}

My code to add the document to the index

 private static void _addToLuceneIndex(MATS_Doc Data, IndexWriter writer)
        {
            // remove older index entry
        //    Query searchQuery = new TermQuery(new Term("Id", Data.Id.ToString()));
          //  writer.DeleteDocuments(searchQuery);

            // add new index entry
            Document doc = new Document();

            // add lucene fields mapped to db fields

            doc.Add(new Field("Id", Data.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Title))
                doc.Add(new Field("Title", Data.Title, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Plant))
                doc.Add(new Field("Plant", Data.Plant, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Containment))
                doc.Add(new Field("Containment", Data.Containment, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Part))
                doc.Add(new Field("Part", Data.Part, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Operation))
                doc.Add(new Field("Operation", Data.Operation, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (!string.IsNullOrEmpty(Data.Geometry))
                doc.Add(new Field("Geometry", Data.Geometry, Field.Store.YES, Field.Index.NOT_ANALYZED));
            if (Data.FileText != null)
                doc.Add(new Field("Text", Data.FileText));
            // add entry to index
            writer.AddDocument(doc);
        }

3条回答
劫难
2楼-- · 2019-08-05 11:17

Its actually very simple to use IFitlers.

I suggest using Eclipse.IndexingService (in c#).

Then all you have to do (besides installing the IFitlers if needed) is:

using (FilterReader filterReader = new FilterReader(path, Path.GetExtension(path)))
{
     filterReader.Init();
     string content = filterReader.ReadToEnd();
}

you can read more about IFitlers here:

http://www.codeproject.com/Articles/31944/Implementing-a-TextReader-to-extract-various-files

http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

查看更多
beautiful°
3楼-- · 2019-08-05 11:20

Lucene by itself cannot process .doc and .docx files. Solr might be worth a look here, as Lucene itself is just a library for building search engines.

查看更多
贪生不怕死
4楼-- · 2019-08-05 11:28

Another option that might be worth looking into is using RavenDB, which internally implements Lucene.Net for it's indexing engine. It looks like you are in a desktop app, so you should consider RavenDB's embedded mode.

You can then use my Indexed Attachments Bundle - which manages much of this for you. You simply upload a document as an attachment, and it takes care of extracting the text from it using IFilters. It builds an index over that text automatically. You can then do a full-text Lucene search on that index. If desired, you can even highlight the search terms found.

Documentation for the bundle is currently lacking, but you should be able to gather what you need from the unit tests.

查看更多
登录 后发表回答