I want to use scala to parse a .mht file, but I found my code is exactly like Java.
Following is a mht
file sample:
From: <Save by Tencent MsgMgr>
Subject: Tencent IM Message
MIME-Version: 1.0
Content-Type:multipart/related;
charset="utf-8"
type="text/html";
boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19"
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type: text/html
Content-Transfer-Encoding:7bit
<html xmlns="http://www.w3.org/1999/xhtml"><head></head>...</html>
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
There is a special line called boundary
, which is a separator line:
------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
The first part is some information about this file, which can be ignored. Following are 4 blocks, the first one is a html
file, others are jpg
images with base64
encoded text.
If I use Java, the code is like:
BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht")))
String line = null;
String boundary = null;
// for a block
String contentType = null;
String encoding = null;
String location = null;
List<String> data = null;
while((line=reader.readLine())!=null) {
// first, get the boundary
if(boundary==null) {
if(line.trim().startsWith("boundary=\"") {
boundary = substringBetween(line, "\"", "\"");
}
continue;
}
if(line.equals("--"+boundary) { // new block
if(contentType!=null) {
// save data to a file
}
encoding=null;
contentType=null;
location = null;
data = new ArrayList<String>();
} else {
if(id==null || contentType==null || location ==null) {
if(line.trim().startsWith("Content-Type:") { /* get content type */ }
// else check encoding
// else check location
} else {
data.add(line);
}
}
}
I tried to use scala to rewrite the code, but I found the structure of my code is nearly the same, except I used the scala syntax instead of Java.
Is there a scala way to do the same work?
PS: I don't want to load the full file into memory, since the file is huge. Instead I want to read and parse it line by line.
Thanks for helping!
This could be a very simple use case of state machine.
I'm going to explain how to build a general solution in a standard way using parser combinators. The other solution presented is much faster, but, once you understand how to do this, you can easily adapt it to other tasks.
First, what you are showing is an e-mail message. The format to such messages is defined in a bunch of RFCs. RFC-822 define basics of header and body, though it enters in considerable detail about the headers, but says nothing about the body. RFC-1521 and 1522 talks about MIME, and are, themselves, revisions of RFCs 1341 and 1342. There are many other RFCs about the subject.
The interesting thing is that they provide grammars about this stuff, so you can write parsers to decompose it correctly. Let's start with a simplified version of RFC822, pretty much ignoring all the known fields and their formats, and simply place everything in a map. I do this because the grammar is rather long, and the few lines I have here can already be compared to the ones in the RFC.
On Scala Parser combinators, every rule is separated by
~
(in the RFC, just spaces separated them), and I use<~
or~>
sometimes to discard an uninteresting part of it. Also, I used^^
to transform what was parsed into a data structure to be used.Now let's deal with the content type. The specification on RFC-1521 is this is implemented below. I have the word
type
between backticks because it's a reserved word in Scala. Also, I'm making a semi-colon optional, because the sample you gave is missing one after definingchar-set
.We can now use this to parse the text.
You can then use it like
Parser(email)
.Again, I'm not proposing you use this solution for your current problem! But learning this might help you in the future.