I get files from queues in Java. They may be of following formats.
- docx
- pdf
- doc
- xls
- xlsx
- txt
- rtf
After reading their extensions, I want to validate whether they are actually files of these types.
For example, I got a file and checked that it has extension .xls. Afterwards, I want to check whether it is actually an .xls file or someone uploaded file of some other format after changing its extension.
EDIT: I'd like to check the file's MIME type by actually checking its content, not its extension. How it can be done?
I don't think this is a problem you should be solving. Any solution to this problem would be brittle and based upon your current understand of what constitutes a valid file of a particular type.
For example, take a XLS file. Do you know for sure what Excel accepts when opening such a file? Can you be sure you'll keep abreast of any changes in future releases that might support a different encoding style?
Ask yourself - what's the worse that could happen if the user uploads a file of the wrong type? Perhaps you'll pass the file to the application that handles that file extension and you'll get an error? Not a problem, just pass that to the user!
Without using external libraries:
You can get the file mimetype using MimetypesFileTypeMap:
File f = new File(...);
System.out.println(new MimetypesFileTypeMap().getContentType(f));
You can get a similar result with:
URLConnection.guessContentTypeFromName
Both these solutions, according to the documentation, look only at the extension.
A better option: URLConnection.guessContentTypeFromStream
File f= new File(...);
System.out.println(URLConnection.guessContentTypeFromStream(new FileInputStream(f)));
This try to guess from the first bytes of the file - be warned this is only a guess - I found it works in most cases, but fails to detect some obvious types.
I recommend a combination of both.