How to parse line-based text file(.mht) in

2019-02-07 09:24发布

I want to use scala to parse a .mht file, but I found my code is exactly like Java.

Following is a mht file sample:

From: <Save by Tencent MsgMgr>
Subject: Tencent IM Message
MIME-Version: 1.0
Content-Type:multipart/related;
    charset="utf-8"
    type="text/html";
    boundary="----=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19"

------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type: text/html
Content-Transfer-Encoding:7bit

<html xmlns="http://www.w3.org/1999/xhtml"><head></head>...</html>

------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi

------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi

------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19
Content-Type:image/jpeg
Content-Transfer-Encoding:base64
Content-Location:{64172C34-99E7-40f6-A933-3DDCF670ACBA}.dat

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMU
FRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQU
FBQUFBQUFBT/wAARCAJwA7sDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUF
BAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVW
V1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi

------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19

There is a special line called boundary, which is a separator line:

------=_NextPart_20CAFF23_6090_43fc_8C0A.EE179EE81D19

The first part is some information about this file, which can be ignored. Following are 4 blocks, the first one is a html file, others are jpg images with base64 encoded text.

If I use Java, the code is like:

BufferedReader reader = new BufferedReader(new FileInputStream(new File("test.mht")))
String line = null;

String boundary = null;

// for a block
String contentType = null;
String encoding = null;
String location = null;
List<String> data = null;

while((line=reader.readLine())!=null) {
    // first, get the boundary
    if(boundary==null) {
        if(line.trim().startsWith("boundary=\"") {
             boundary = substringBetween(line, "\"", "\"");
        }
        continue;
    }

    if(line.equals("--"+boundary) { // new block
        if(contentType!=null) {
           // save data to a file
        }
        encoding=null;
        contentType=null;
        location = null;
        data = new ArrayList<String>();
    } else {
        if(id==null || contentType==null || location ==null) {
            if(line.trim().startsWith("Content-Type:") { /* get content type */ }
            // else check encoding
            // else check location
        } else {
            data.add(line);
        }
    }
}

I tried to use scala to rewrite the code, but I found the structure of my code is nearly the same, except I used the scala syntax instead of Java.

Is there a scala way to do the same work?

PS: I don't want to load the full file into memory, since the file is huge. Instead I want to read and parse it line by line.

Thanks for helping!

2条回答
迷人小祖宗
2楼-- · 2019-02-07 09:46

This could be a very simple use case of state machine.

import collection.mutable.ListBuffer
case class Part(contentType:Option[String], encoding:Option[String], location:Option[String], data:ListBuffer[String])

var boundary: String = null 

val Boundary = """.*boundary="(.*)"""".r
var state = 0
val IN_PART = 1
val IN_DATA = 2

var _contentType:Option[String] = None
var _encoding:Option[String] = None
var _location:Option[String] = None
var _data = new ListBuffer[String]()

Source.fromFile("test.mht").getLines.foreach{
  case Boundary(b) => boundary = b
  case `boundary` => 
    _contentType = None
    _encoding = None
    _location = None
    _data = new ListBuffer[String]()    
    state = IN_PART
  case "" => state match {
    case IN_PART => state = IN_DATA
    case IN_DATA => 
        var currentPart = Part(_contentType, _encoding, _location, _data)
        /* deal with current Part as allData.last */
    case _ =>
  }
  case line => state match {
    case IN_DATA => _data.append(line)            
    case IN_PART => line.split(":") match {
      case Array("Content-Type", t) => _contentType = Some(t)
      case Array("Content-Transfer-Encoding", e) => _encoding = Some(e)
      case Array("Content-Location", l) => _location = Some(l)
      case _ =>
    }
  }
}
查看更多
劫难
3楼-- · 2019-02-07 09:50

I'm going to explain how to build a general solution in a standard way using parser combinators. The other solution presented is much faster, but, once you understand how to do this, you can easily adapt it to other tasks.

First, what you are showing is an e-mail message. The format to such messages is defined in a bunch of RFCs. RFC-822 define basics of header and body, though it enters in considerable detail about the headers, but says nothing about the body. RFC-1521 and 1522 talks about MIME, and are, themselves, revisions of RFCs 1341 and 1342. There are many other RFCs about the subject.

The interesting thing is that they provide grammars about this stuff, so you can write parsers to decompose it correctly. Let's start with a simplified version of RFC822, pretty much ignoring all the known fields and their formats, and simply place everything in a map. I do this because the grammar is rather long, and the few lines I have here can already be compared to the ones in the RFC.

On Scala Parser combinators, every rule is separated by ~ (in the RFC, just spaces separated them), and I use <~ or ~> sometimes to discard an uninteresting part of it. Also, I used ^^ to transform what was parsed into a data structure to be used.

import scala.util.parsing.combinator._

/** Object companion to RFC822, containing the Message class,
 *  and extending the trait so that it can be used as a parser
 */
object RFC822 extends RFC822 {
  case class Message(header: Map[String, String], text: String)
}

/**
 *  Parsers `message` according to RFC-822 (http://www.w3.org/Protocols/rfc822/),
 *  but without breaking up the contents for each field, 
 *  nor identifying particular fields.
 *
 *  Also, introduces "header" to convert all fields into a map.
 */
class RFC822 extends RegexParsers {
  import RFC822.Message

  override def skipWhitespace = false

  def message = (header <~ CRLF) ~ text ^^ {
    case hd ~ txt => Message(hd, txt)
  }

  // this isn't part of the RFC, but we use it to generate a map
  def header = field.* ^^ { _.toMap }

  def field = (fieldName <~ ":") ~ fieldBody <~ CRLF ^^ { case name ~ body => name -> body }
  def fieldName = """[^:\P{Graph}]+""".r

  // Recursive definition needs a type
  // Also, I use .+ on LWSPChar because it's specified for the lexer,
  // which we are not using
  def fieldBody: Parser[String] = fieldBodyContents ~ (CRLF ~> LWSPChar.+ ~> fieldBody).? ^^ { 
    case a ~ Some(b) => a + " " + b // reintroduces a single LWSPChar
    case a ~ None    => a
  }
  def fieldBodyContents = ".*".r

  def CRLF = """\n""".r  // this needs to be the regex \n pattern
  def LWSPChar = " " | "\t"  // these do not need to be regex

  def text = "(?s).*".r // (?s) makes . match newlines
}

Now let's deal with the content type. The specification on RFC-1521 is this is implemented below. I have the word type between backticks because it's a reserved word in Scala. Also, I'm making a semi-colon optional, because the sample you gave is missing one after defining char-set.

object ContentType extends ContentType {
  case class Content(`type`: String, subtype: String, parameter: Map[String, String])
}

class ContentType extends RegexParsers {
  import ContentType.Content

  // case-insensitive matching of type and subtype
  def content =   ("Content-Type" ~> ":" ~> `type` <~ "/") ~ subtype ~ parameters ^^ {
    case t ~ s ~ p => Content(t, s, p)
  }

  // use this to generate a map
  // *** SEMI-COLON IS NOT OPTIONAL ***
  // I'm making it optional because the example is missing one
  def parameters = (";".? ~> parameter).* ^^ (_.toMap)

  // All values case-insensitive
  def `type` = ( "(?i)application".r | "(?i)audio".r
               | "(?i)image".r       | "(?i)message".r
               | "(?i)multipart".r   | "(?i)text".r
               | "(?i)video".r       | extensionToken
               )

  def extensionToken =  xToken | ianaToken
  def ianaToken = failure("IANA token not implemented")
  def xToken = """(?i)x-(?!\s)""".r ~ token ^^ { case a ~ b => a + b }

  def subtype = token

  def parameter = (attribute <~ "=") ~ value ^^ { case a ~ b => a -> b }
  def attribute = token // case-insensitive
  def value = token | quotedString

  def token: Parser[String] =  not(tspecials) ~> """\p{Graph}""".r ~ token.? ^^ {
    case a ~ Some(b) => a + b
    case a ~ None    => a
  }

  // Must be in quoted-string,
  // to use within parameter values
  def tspecials =  ( "(" | ")" | "<" | ">" | "@"
                   | "," | ";" | ":" | "\\" | "\""
                   | "/" | "[" | "]" | "?" | "="
                   )

  // These are part of RFC822
  def qtext = """[^\\"\n]""".r
  def quotedPair =  """\\.""".r
  def quotedString = "\"" ~> (qtext|quotedPair).* <~ "\"" ^^ { _.mkString }
}

We can now use this to parse the text.

object Parser {
  def apply(email: String): Option[(Map[String, String], List[String])] = {
    import RFC822._

    parseAll (message, email) match {
      case Success(result, _) =>
        if (result.header get "Content-Type" nonEmpty) Some(getParts(result))
        else Some(result.header -> List(result.text))
      case _ => None
    }
  }

  def getParts(message: RFC822.Message): (Map[String, String], List[String]) = {
    import ContentType._

    parseAll (content, "Content-Type: " + message.header("Content-Type")) match {
      case Success(Content("multipart", _, parameters), _) =>
        // The ^.* part eats starting characters; it doesn't seem to be
        // as spec'ed, but the sample has two extra dashes at the start
        // of the line
        val parts = message.text split ("^.*?\\Q" + parameters("boundary") + "\\E")
        val bodies = flatMap this.apply flatMap (_._2)
        message.header -> bodies.toList
      case _ => message.header -> List(message.text)
    }
  }
}

You can then use it like Parser(email).

Again, I'm not proposing you use this solution for your current problem! But learning this might help you in the future.

查看更多
登录 后发表回答