The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc
(a docx
file essentially) if you open using winrar, you will find xml
files. As it is known that a docx
file is a zip
file consists of xml
files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc
or docx
is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc
or docx
inside a compressed file, zip
, 7z
or even rar
. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream
.
What is the best way to judge a file is a doc
or docx
? I want a solution to read the content from a file which may be doc
or docx
. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream
is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc
or docx
by an exception?
Using the current stable
apache poi
version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.Example: