Is it possible to create nested RDDs in Apache Spa

2019-01-15 17:04发布

I am trying to implement K-nearest neighbor algorithm in Spark. I was wondering if it is possible to work with nested RDD's. This will make my life a lot easier. Consider the following code snippet.

public static void main (String[] args){
//blah blah code
JavaRDD<Double> temp1 = testData.map(
    new Function<Vector,Double>(){
        public Double call(final Vector z) throws Exception{
            JavaRDD<Double> temp2 = trainData.map(
                    new Function<Vector, Double>() {
                        public Double call(Vector vector) throws Exception {
                            return (double) vector.length();
                        }
                    }
            );
            return (double)z.length();
        }    
    }
);
}

Currently I am getting error with this nested settings (I can post here the full log). Is it allowed in the fist place? Thanks

2条回答
迷人小祖宗
2楼-- · 2019-01-15 17:31

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. And if it does not contain a lot of data, you might and you should use an array or something like it.

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2).

查看更多
地球回转人心会变
3楼-- · 2019-01-15 17:31

I ran into nullpointer exception while trying something of this sort.As we can't perform operations on RDDs within a RDD.

Spark doesn't support nesting of RDDs the reason being - to perform an operation or create a new RDD spark runtime requires access to sparkcontext object which is available only in the driver machine.

Hence if you want to operate on nested RDDs, you may collect the parent RDD on driver node then iterate it's items using array or something.

Note:- RDD class is serializable. Please see below.

enter image description here

查看更多
登录 后发表回答