how to know whether a file is .docx or .doc format

2019-07-13 06:36发布

问题:

I know we can get it done by extension or by mime type, do we have any other way through which we can get the idea of type of file whether it is .docx or .doc.

回答1:

If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:

boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;

where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)



回答2:

You could use Apache Tika for content Detection. But you should been aware that this is a huge framework (many required dependencies) for such a small task.



回答3:

There is a way, no strightforward though. But with Apache POI, you can locate it.

Try to read a .docx file using HWPFDocument Class. It would give you the following error

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

String filePath = "C:\\XXXX\XXXX.docx";
FileInputStream inStream;
try {
    inStream = new FileInputStream(new File(filePath));
    HWPFDocument doc = new HWPFDocument(inStream);
    WordExtractor wordExtractor = new WordExtractor(doc);
    System.out.println("Getting words"+wordExtractor.getText());
} catch (Exception e) {
    System.out.print("Its not a .doc format");
}

.docx can be read using XWPFDocument Class.



回答4:

Why dont you use Apache Tika:

File file = new File('File Here');

  Tika tika = new Tika();

  String filetype = tika.detect(file);
  System.out.println(filetype);


回答5:

Assuming you're using Apache POI, you have a few options.

One is to grab the first few bytes of the file, and ask POIFSFileSystem with the hasPOIFSHeader(byte) method. If you have a stream that supports mark/reset, you can instead use POIFSFileSystem.hasPOIFSHeader(InputStream). If those return true then try to open it as a .doc with HWPF, otherwise try as .docx with XWPF

Otherwise, if you prefer a try/catch way, try to open it with POIFSFileSystem and catch OfficeXmlFileException - if it opens fine it's .doc, if you get the exception it's .docx

If you look at the source code for WorkbookFactory you'll see the first pattern in use, you can copy a similar set of logic form that