I'm getting a java.lang.OutOfMemoryError: GC overhead limit exceeded in Hive. In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?
I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?
select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei
Thank you
In order to improve performance, by looking at your query: I would partition the hive tables by
yyyy, mm, dd
, or by first two digits ofimei
, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote foryyyy, mm, dd
, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioningBut for now, this should give you some improvements:
if
TO_DATE(timestamp)
values are inserted on same day, in other words if both values are same for date thanand pd.timestamp=ed.pd.timestamp
condition should be excluded.Try running both queries and compare results. Do let us know the differences and if you find this helpful.
Everything depends on your data. If your datasets contain too many rows with same imei then applying distinct before join is better than applying count(distinct) after join (you wrote it right in this case). And vice-versa: if there are few repeated rows with same imei then you can join them first. Applying distinct to each dataset before joining them is safer thought maybe slower depending on your data. I would recomment to partition your datasets by to_date(timestamp) field (yyyy-mm-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.
It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as i mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:
set hive.exec.reducers.bytes.per.reducer=67108864;
or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).Or add random to distribute the data between more reducers evenly:
distribute by substr(imei...), FLOOR(RAND()*100.0)%20