I want to use Array[Byte] as Key from RDD. For example:
val rdd1:RDD[((Array[Byte]), (String, Int)] = from src rdd
val rdd2:RDD[((Array[Byte]), (String, Int)] = from dest rdd
val resultRdd = rdd1.join(rdd2)
I want to perform join operation on rdd1 and rdd2 using Array[Byte] as Key but always getting resultRdd.count() = 0.
So I tried to serialize the Array[Byte] and It is working fine as I want to see, only for small Dataset.
val serRdd1= rdd1.map { case (k,v) => (new SerByteArr(k), v) }
val serRdd2= rdd2.map { case (k,v) => (new SerByteArr(k), v) }
class SerByteArr(val bytes: Array[Byte]) extends Serializable {
override val hashCode = bytes.deep.hashCode
override def equals(obj:Any) = obj.isInstanceOf[SerByteArr] && obj.asInstanceOf[SerByteArr].bytes.deep == this.bytes.deep
}
For Large dataset, getting java.lang.OutOfMemoryError: GC overhead limit exceeded, Problem is occuring in creating the object(new SerByteArr(k)).
How to avoid the GC limit exceed error. Does anyone help me?