I have a 1.6T Hive table with time series data. I am using Hive 1.2.1
and Spark 1.6.1
in scala
.
Following is the query which I have in my code. But I always get Java out of memory error
.
val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000 ")
By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe
I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters
--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"
kindly give me suggestions on how to optimize this and successfully fetch data from hive table.
Thanks