How might I index PDF files using Lucene.Net?

2019-03-11 17:56发布

I'm looking for some sample code demonstrating how to index PDF documents using Lucene.Net and C#. Google turned up a few, but none that I could find helpful.

标签： c# lucene.net implementation

2条回答

老娘就宠你

2楼-- · 2019-03-11 18:27

From my understanding, Lucene is limited to creating an index and searching that index. It's up to the application to handle opening files and extracting their contents for the index. So if you're looking to search PDF documents you'll want to use something like iTextSharp to open the file, pull out the contents, and pass it to Lucene for indexing. There are some good starting examples of using Lucene on the Dimecasts.net website.

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-03-11 18:44

StringBuilder stringBuilder = new StringBuilder();

PdfReader pdfReader = new PdfReader(byte[] of the .pdf);

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    stringBuilder.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page) + " ");
}

(using iTextSharp)

The rest isn't as succinctly illustrated.

There is code in the product demo on my site that shows how to use the lucene.net code, but it is a little long to post here.

Here is the code as pertaining to my product: https://svn.arachnode.net/svn/arachnodenet/trunk/Plugins/CrawlActions/ManageLuceneDotNetIndexes.cs Username/Password: Public

0人赞添加讨论(0) 举报

How might I index PDF files using Lucene.Net?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间