I have a very basic question. Spark's flatMap
function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.
My question is: what happens if this list is too large for your memory to handle!?
I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting context.write()
anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.
In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)
No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala
RDD.flatMap
signature:Since subclasses of
TraversableOnce
includeSeqView
orStream
you can use a lazy sequence instead of aList
. For example:Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:
Since
RDDs
are lazily evaluated it is even possible to return an infinite sequence from theflatMap
. Using a little bit oftoolz
power: