I execute a join using a javaHiveContext
in Spark.
The big table is 1,76Gb and has 100 millions record.
The second table is 273Mb and has 10 millions record.
I get a JavaSchemaRDD
and I call count()
on it:
String query="select attribute7,count(*) from ft,dt where ft.chiavedt=dt.chiavedt group by attribute7";
JavaSchemaRDD rdd=sqlContext.sql(query);
System.out.println("count="+rdd.count());
If I force a broadcastHashJoin (SET spark.sql.autoBroadcastJoinThreshold=290000000)
and use 5 executor on 5 node with 8 core and 20Gb of memory it is executed in 100 sec.
If i don't force broadcast it is executed in 30 sec.
N.B. the tables are stored as Parquet file.