Is there a Hive query to quickly find table size (i.e. number of rows) without launching a time-consuming MapReduce job? (Which is why I want to avoid COUNT(*)
.)
I tried DESCRIBE EXTENDED
, but that yielded numRows=0
which is obviously not correct.
(Apologies for the newb question. I tried Googling and searching the apache.org documentation without success.)
It is a good question. the count() will take much time for finding the result. But unfortunately, count() is the only way to do.
There is an alternative way(can't say alternate but better latency than above case) :
Set the property
set hive.exec.mode.local.auto=true;
and run the same command (
select count(*) from tbl
) which gives better latency than prior.Use parquet format to store data of your external/internal table. Then you will get quicker results.
Here is the quick command
For Example,If table is partitioned
output is
Partition logdata.ops_bc_log{day=20140523} stats: [numFiles=37, numRows=26095186, totalSize=654249957, rawDataSize=58080809507]
Partition logdata.ops_bc_log{day=20140521} stats: [numFiles=30, numRows=21363807, totalSize=564014889, rawDataSize=47556570705]
Partition logdata.ops_bc_log{day=20140524} stats: [numFiles=35, numRows=25210367, totalSize=631424507, rawDataSize=56083164109]
Partition logdata.ops_bc_log{day=20140522} stats: [numFiles=37, numRows=26295075, totalSize=657113440, rawDataSize=58496087068]
OK
Time taken: 5.252 seconds
tblproperties will give the size of the table and can be used to grab just that value if needed.
How about using :
solution, though not quick
if the table is partitioned, we can count the number of partitions and count(number of rows) in each partition.
For example:, if partition by date (mm-dd-yyyy)