I know about the parallel collections in Scala. They are handy! However, I would like to iterate over the lines of a file that is too large for memory in parallel. I could create threads and set up a lock over a Scanner, for example, but it would be great if I could run code such as:
Source.fromFile(path).getLines.par foreach { line =>
Unfortunately, however
error: value par is not a member of Iterator[String]
What is the easiest way to accomplish some parallelism here? For now, I will read in somes lines and handle them in parallel.
I'll put this as a separate answer since it's fundamentally different from my last one (and it actually works)
Here's an outline for a solution using actors, which is basically what Kim Stebel's comment describes. There are two actor classes, a single FileReader actor that reads individual lines from the file on demand, and several Worker actors. The workers all send requests for lines to the reader, and process lines in parallel as they are read from the file.
I'm using Akka actors here but using another implementation is basically the same idea.
This way, no more than 4 (or however many workers you have) unprocessed lines are in memory at a time.
The comments on Dan Simon's answer got me thinking. Why don't we try wrapping the Source in a Stream:
Then you could consume it in parallel like this:
I tried this out, and it compiles and runs at any rate. I'm not honestly sure if it's loading the whole file into memory or not, but I don't think it is.
We ended up writing a custom solution at our company so we would understand the parallelism exactly.
You could use grouping to easily slice the iterator into chunks you can load into memory and then process in parallel.
In my opinion, something like this is the simplest way to do it.
I realize this is an old question, but you may find the
ParIterator
implementation in the iterata library to be a useful no-assembly-required implementation of this:Below helped me to achieve