I have a 1.6T Hive table with time series data. I am using Hive 1.2.1
and Spark 1.6.1
in scala
.
Following is the query which I have in my code. But I always get Java out of memory error
.
val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000 ")
By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe
I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters
--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"
kindly give me suggestions on how to optimize this and successfully fetch data from hive table.
Thanks
The problem is likely here:
You should avoid using
LIMIT
to subset large number of records. In Spark,LIMIT
moves all rows to a single partition and is likely to cause serious performance and stability issues.See for example How to optimize below spark code (scala)?
This doesn't sound right. Sliding window operations can be usually achieved with some combination of window function, and timestamp-based
window
buckets.