The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc
(a docx
file essentially) if you open using winrar, you will find xml
files. As it is known that a docx
file is a zip
file consists of xml
files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc
or docx
is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc
or docx
inside a compressed file, zip
, 7z
or even rar
. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream
.
What is the best way to judge a file is a doc
or docx
? I want a solution to read the content from a file which may be doc
or docx
. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream
is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc
or docx
by an exception?