How can I get row count from all tables using hive? I am interested in the database name, table name and row count
问题:
回答1:
You will need to do a
select count(*) from table
for all tables.
To automate this, you can make a small bash script and some bash commands. First run
$hive -e 'show tables' | tee tables.txt
This stores all tables in the database in a text file tables.txt
Create a bash file (count_tables.sh) with the following contents.
while read line
do
echo "$line "
eval "hive -e 'select count(*) from $line'"
done
Now run the following commands.
$chmod +x count_tables.sh
$./count_tables.sh < tables.txt > counts.txt
This creates a text file(counts.txt) with the counts of all the tables in the database
回答2:
A much faster way to get approximate count of all rows in a table is to run explain on the table. In one of the explain clauses, it shows row counts like below:
TableScan [TS_0] (rows=224910 width=78)
The benefit is that you are not actually spending cluster resources to get that information.
The HQL command is explain select * from table_name;
but when not optimized not shows rows in the TableScan.
回答3:
You can collect the statistics on the table by using Hive ANALAYZE command. Hive cost based optimizer makes use of these statistics to create optimal execution plan.
Below is the example of computing statistics on Hive tables:
hive> ANALYZE TABLE stud COMPUTE STATISTICS;
Query ID = impadmin_20171115185549_a73662c3-5332-42c9-bb42-d8ccf21b7221
Total jobs = 1
Launching Job 1 out of 1
…
Table training_db.stud stats: [numFiles=5, numRows=5, totalSize=50, rawDataSize=45]
OK
Time taken: 8.202 seconds
Links: http://dwgeek.com/apache-hive-explain-command-example.html/
回答4:
select count(*) from table
I think there is no more efficient way.
回答5:
try this guys to automate-- put in shell after that run bash filename.sh
hive -e 'select count(distinct fieldid) from table1 where extracttimestamp<'2018-04-26'' > sample.out
hive -e 'select count(distinct fieldid) from table2 where day='26'' > sample.out
lc=cat sample.out | uniq | wc -l
if [ $lc -eq 1 ]; then
echo "PASS"
else
echo "FAIL"
fi
回答6:
How do I mention the specific database that it needs to refer in below snippet:
while read line
do
echo "$line "
eval "hive -e 'select count(*) from $line'"
done
回答7:
Here's a solution I wrote that uses python:
import os
dictTabCnt={}
print("=====Finding Tables=====")
tableList = os.popen("hive --outputformat=dsv --showHeader=false -e \"use [YOUR DB HERE]; show tables;\"").read().split('\n')
print("=====Finding Table Counts=====")
for i in tableList:
if i <> '':
strTemp = os.popen("hive --outputformat=dsv --showHeader=false -e \"use [YOUR DB HERE]; SELECT COUNT(*) FROM {}\"".format(i)).read()
dictTabCnt[i] = strTemp
print("=====Table Counts=====")
for table,cnt in dictTabCnt.items():
print("{}: {}".format(table,cnt))
回答8:
You can also set the database in the same command and separate with ;
.
hive -e 'use myDatabase;show tables'