可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm getting a java.lang.OutOfMemoryError: GC overhead limit exceeded in Hive. In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?

I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?

select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei

Thank you

回答1:

Everything depends on your data. If your datasets contain too many rows with same imei then applying distinct before join is better than applying count(distinct) after join (you wrote it right in this case). And vice-versa: if there are few repeated rows with same imei then you can join them first. Applying distinct to each dataset before joining them is safer thought maybe slower depending on your data. I would recomment to partition your datasets by to_date(timestamp) field (yyyy-mm-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.

It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as i mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:

set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).

Or add random to distribute the data between more reducers evenly:
distribute by substr(imei...), FLOOR(RAND()*100.0)%20

回答2:

In order to improve performance, by looking at your query: I would partition the hive tables by yyyy, mm, dd, or by first two digits of imei, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote for yyyy, mm, dd, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioning

But for now, this should give you some improvements:

Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where 
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and pd.timestamp=ed.pd.timestamp
and SUBSTR(pd.imei,12,2) in ('04','05') 
and ed.event = 'Use' AND ed.clientversion like '3.2%';

if TO_DATE(timestamp) values are inserted on same day, in other words if both values are same for date than and pd.timestamp=ed.pd.timestamp condition should be excluded.

Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where 
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and SUBSTR(pd.imei,12,2) in ('04','05') 
and ed.event = 'Use' AND ed.clientversion like '3.2%';

Try running both queries and compare results. Do let us know the differences and if you find this helpful.