How to parse text from MS Word document to string

2019-01-25 21:50发布

I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.

I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.

2条回答
叛逆
2楼-- · 2019-01-25 22:28

You could give a look at NPOI:

This project is the .NET version of POI Java project at http://poi.apache.org/. POI is an open source project which can help you read/write xls, doc, ppt files. It has a wide application.

Take a look at this previous SO thread for more information.

查看更多
Viruses.
3楼-- · 2019-01-25 22:34

Open XML SDK is only for 2007 and newer formats and it is not trivial to use.

If performance is not an issue you could use Word Automation and have Word do this for you. It will look something like this:

var app = new Application();
var doc = app.Documents.Open(documentLocation);

string rangeText = doc.Range().Text;

doc.Save();
doc.Close();

Marshal.ReleaseComObject(doc);    
Marshal.ReleaseComObject(app);

Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).

Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx

查看更多
登录 后发表回答