Microsoft word Text Parser in “C”

2019-04-03 00:02发布

I would like to know the procedure to adopt to parse and obtain text content from Microsoft word (.doc and .docx) documents . programming language used should be plain "C" (should be gcc).

Are there any libraries that already do this job,

extension : can i use the same procedure to parse text from Microsoft power point files also ?

4条回答
仙女界的扛把子
2楼-- · 2019-04-03 00:14

Microsoft Word documents are an enormous beast - you definitely don't want to be writing this code yourself. Look into using an existing free Word library such as antiword or wvWare.

查看更多
小情绪 Triste *
3楼-- · 2019-04-03 00:16

If you're willing to go through the effort of using a COM interface in C, you can use the IFilter interface built into every version of Windows since Windows 2000. You can use it to extract text from any office document (Word, Excel, etc.), PDF file or any file type that has IFilter support installed.

I wrote a blog post about it a few years back. It's all C++, but you can use COM objects from C.

查看更多
戒情不戒烟
4楼-- · 2019-04-03 00:25

I don't know about libraries that exist, but the format specifications are available from Microsoft for free and under a promise not to sue you for using them.

查看更多
Ridiculous、
5楼-- · 2019-04-03 00:32

on windows, let word do the job and interface with the COM object, on linux, the job was done in antiword. Or you can automate OpenOffice.org on any platform with the UNO object model.

查看更多
登录 后发表回答