Spark FlatMap function for huge lists

2019-01-15 04:32发布

I have a very basic question. Spark's flatMap function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.

My question is: what happens if this list is too large for your memory to handle!?

I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting context.write() anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.

In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)

1条回答
Anthone
2楼-- · 2019-01-15 05:11

So the (lambda) function you feed to flatMap should return a list.

No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala RDD.flatMap signature:

flatMap[U](f: (T) ⇒ TraversableOnce[U])

Since subclasses of TraversableOnce include SeqView or Stream you can use a lazy sequence instead of a List. For example:

val rdd = sc.parallelize("foo" :: "bar" :: Nil)
rdd.flatMap {x => (1 to 1000000000).view.map {
    _ => (x, scala.util.Random.nextLong)
}}

Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:

import numpy as np

rdd = sc.parallelize(["foo", "bar"])
rdd.flatMap(lambda x: ((x, np.random.randint(1000)) for _ in xrange(100000000)))

Since RDDs are lazily evaluated it is even possible to return an infinite sequence from the flatMap. Using a little bit of toolz power:

from toolz.itertoolz import iterate
def inc(x):
    return x + 1

rdd.flatMap(lambda x: ((i, x) for i in iterate(inc, 0))).take(1)
查看更多
登录 后发表回答