I need to mine the content of most of known document files like:
- html
- doc/docx etc.
For most of these file formats I am planning to use:
But as of now Tika
does not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML )
There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) but I found none in Java.
I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...
As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) and the code project
page i mentioned earlier ... mht files use GZip compression ....
Attempting to decompress in java results in following exceptions:
With java.uti.zip.GZIPInputStream
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)
And with java.util.zip.ZipFile
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)
Kindly suggest how to decompress it....
Thanks....
Frankly, I wasn't expecting a solution in near future and was about to give up, but some how I stumbled on this page:
http://en.wikipedia.org/wiki/MIME#Multipart_messages
http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx
Although, not a very catchy in first look. But if you look carefully you will get clue. After reading this I fired up my IE and at random started saving pages as
*.mht
file. Let me go line by line...But let me explain beforehand that my ultimate goal was to separate/extract out the
html
content and parse it... the solution is not complete in itself as it depends on thecharacter set
orencoding
I choose while saving. But even though it will extract the individual files with minor hitches...I hope this will be useful for anyone who is trying to parse/decompress
*.mht/MHTML
files :)======= Explanation ======== ** Taken from a mht file **
It is the software used for saving the file
Subject, date and mime-version … much like the mail format
This is the part which tells us that it is a
multipart
document. A multipart document has one or more different sets of data combined in a single body, amultipart
Content-Type field must appear in the entity's header. Here, we can also see the type as"text/html"
.Out of all this is the most important part. This is the unique delimiter which divides two different parts (html,images,css,script etc). Once you get hold of this, everything gets easy... Now, I just have to iterate through the document and finding out different sections and saving them as per their
Content-Transfer-Encoding
(base64, quoted-printable etc) ... . . .SAMPLE
** JAVA CODE **
An interface for defining constants.
The main parser class...
Regards,
U can try http://www.chilkatsoft.com/mht-features.asp , it can pack/unpack and you can handle it after as normal files. The download link is: http://www.chilkatsoft.com/java.asp
i was used http://jtidy.sourceforge.net to parse/read/index mht files (but as normal files, not compressed files)
Late to the party, but expanding on @wener's answer for anyone else stumbling across this.
The Apache Mime4J library seems to have the most readily accessible solution for EML or MHTML processing, much easier than rolling-your-own!
My prototype '
parseMhtToFile
' function below rips html files and other artifacts out of a Cognos active report 'mht' file, but could be tailored to other purposes.This is written in Groovy and requires Apache Mime4J 'core' and 'dom' jars (currently 0.7.2).
Usage is simply:
Output is:
Thoughts on other improvements:
For 'text' mime parts, you can access a
Reader
instead of aStream
which might be more appropriate for text mining as the OP requested.For generated filename extensions, I'd use another library to lookup appropriate extension, not assume the mime sub-type is adequate.
Handle Single-body (non-Multipart) and Recursive Multipart mhtml files and other complexities. These may require a MimeStreamParser with custom Content Handler implementation.
You don't have to do it on you own.
With dependency
Roll you mht file
MessageTree
willThen you can look into it.
;-)