I would like to know the procedure to adopt to parse and obtain text content from Microsoft word (.doc and .docx) documents . programming language used should be plain "C" (should be gcc).
Are there any libraries that already do this job,
extension : can i use the same procedure to parse text from Microsoft power point files also ?
Microsoft Word documents are an enormous beast - you definitely don't want to be writing this code yourself. Look into using an existing free Word library such as antiword or wvWare.
If you're willing to go through the effort of using a COM interface in C, you can use the IFilter interface built into every version of Windows since Windows 2000. You can use it to extract text from any office document (Word, Excel, etc.), PDF file or any file type that has IFilter support installed.
I wrote a blog post about it a few years back. It's all C++, but you can use COM objects from C.
I don't know about libraries that exist, but the format specifications are available from Microsoft for free and under a promise not to sue you for using them.
on windows, let word do the job and interface with the COM object, on linux, the job was done in antiword. Or you can automate OpenOffice.org on any platform with the UNO object model.