I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.
相关问题
- Sorting 3 numbers without branching [closed]
- Graphics.DrawImage() - Throws out of memory except
- Why am I getting UnauthorizedAccessException on th
- 求获取指定qq 资料的方法
- How to know full paths to DLL's from .csproj f
Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).
If you're looking for asp.net options, the interop won't work unless you install office on the server. Even then, Microsoft says not to do it.
I used Spire.Doc, worked beautifully. Spire.Doc download It even read documents that were really .txt but were saved .doc. They have free and pay versions. You can also get a trial license that removes some warning from documents that you create, but I didn't create any, just searched them so the free version worked like a charm.
A bit late to the party, but nevertheless - nowadays you don't need to download anything - all is already installed with .NET: (just make sure to add references to System.IO.Compression and System.IO.Compression.FileSystem)
For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.
See: http://msdn.microsoft.com/en-us/library/bb448854.aspx
Let me just correct a little bit the answer given by KyleM. I just added processing of two extra nodes, which influence the result: one is responsible for the horizontal tabulation with "\t", other - for the vertical tabulation with "\v". Here is the code:
Use The Microsoft Office Interop. It's free and slick. Here how I pulled all the words from a doc.
Then do whatever you want with the words.