Spark - Nested RDD Operation

2019-05-08 04:17发布

问题:

I have two RDDs say

   rdd1 = 
id            | created     | destroyed | price   
1            | 1            | 2            | 10        
2            | 1            | 5            | 11       
3            | 2            | 3            | 11        
4            | 3            | 4            | 12        
5            | 3            | 5            | 11       

rdd2 =

[1,2,3,4,5] # lets call these value as timestamps (ts)

rdd2 is basically generated using range(intial_value, end_value, interval). The params here can vary. The size can be same or different to rdd1. The idea is to fetch records from rdd1 into rdd2 based on the values of rdd2 using a filtering criertia(records from rdd1 can repeat while fetching as you can see in output)

filtering criteria rdd1.created <= ts < rdd1.destroyed)

Expected output:

ts             | prices  
1              | 10,11       # i.e. for ids 1,2 of rdd1      
2              | 11,11       # ids 2,3
3              | 11,12,11    # ids 2,4,5 
4              | 11,11       # ids 2,5

Now I want to filter RDD1 based on some condition which uses the keys of RDD2. (described above) And returns the results which joins the keys of RDD2 and filtered results of RDD1

So I do:

rdd2.map(lambda x : somefilterfunction(x, rdd1))  

def somefilterfunction(x, rdd1):
    filtered_rdd1 = rdd1.filter(rdd1[1] <= x).filter(rdd1[2] > x)
    prices = filtered_rdd1.map(lambda x : x[3])
    res = prices.collect()
    return (x, list(res))

And I get:

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

I tried using groupBy , but since here elements of rdd1 can be repeated again and again as compared to grouping which I understand would club each element of rdd1 in some particular slot just once.

The only way is now to use a normal for loop and do the filtering and join everything in the end.

Any Suggestions?

回答1:

Since you use regular range there is no reason to create a second RDD at all. You can simply generate values in a specific range for each record:

from __future__ import division # Required only for Python 2.x
from math import ceil
from itertools import takewhile

rdd1 = sc.parallelize([
    (1, 1, 2, 10),        
    (2, 1, 5, 11),       
    (3, 2, 3, 11),        
    (4, 3, 4, 12),        
    (5, 3, 5, 11),  
])


def generate(start, end, step):
    def _generate(id, created, destroyed, price):
        # Smallest ts >= created
        start_for_record = int(ceil((created - start) / step) * step + start)
        rng = takewhile(
            lambda x: created <= x < destroyed,
            xrange(start_for_record, end, step)) # In Python 3.x use range
        for i in rng:
            yield i, price

    return _generate

result = rdd1.flatMap(lambda x: generate(1, 6, 1)(*x)).groupByKey()

And result:

result.mapValues(list).collect()

## [(1, [10, 11]), (2, [11, 11]), (3, [11, 12, 11]), (4, [11, 11])]