Scala Parser Combinators: Parsing in a stream

2019-05-26 17:29发布

问题:

I'm using the native parser combinator library in scala, and I'd like to use it to parse a number of large files. I have my combinators set up, but the file that I'm trying to parse is too large to be read into memory all at once. I'd like to be able to stream from an input file through my parser and read it back to disk so that I don't need to store it all in memory at once.My current system looks something like this:

val f = Source.fromFile("myfile")
parser.parse(parser.document.+, f.reader).get.map{_.writeToFile}
f.close

This reads the whole file in as it parses, which I'd like to avoid.

回答1:

There is no easy or built-in way to accomplish this using scala's parser combinators, which provide a facility for implementing parsing expression grammars.

Operators such as ||| (longest match) are largely incompatible with a stream parsing model, as they require extensive backtracking capabilities. In order to accomplish what you are trying to do, you would need to re-formulate your grammar such that no backtracking is required, ever. This is generally much harder than it sounds.

As mentioned by others, your best bet would be to look into a preliminary phase where you chunk your input (e.g. by line) so that you can handle a portion of the stream at a time.



回答2:

One easy way of doing it is to grab an Iterator from the Source object and then walk through the lines like so:

val source = Source.fromFile("myFile")
val lines = source.getLines
for (line <- lines) {
    // Do magic with the line-value
}
source.close // Close the file

But you will need to be able to use the lines one by one in your parser of course.

Source: https://groups.google.com/forum/#!topic/scala-user/LPzpXo3sUVE



回答3:

You might try the StreamReader class that is part of the parsing package.

You would use it something like:

val f = StreamReader( fromFile("myfile","UTF-8").reader() )

parseAll( parser, f )


回答4:

The longest match as one poster above mentioned combined with regex's using source.subSequence(0, source.length) means even StreamReader doesn't help.

The best kludgy answer I have is use getLines as others have mentioned, and chunk as the accepted answer mentions. My particular input required me to chunk 2 lines at a time. You could build an iterator out of the chunks you build to make it slightly less ugly.