Fast way to discover the row count of a table in P

2020-01-24 02:28发布

问题:

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.

I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.

Is it possible to stop counting as soon as my constant value is surpassed?

I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.

Something like this:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

回答1:

Counting rows in big tables is known to be slow in PostgreSQL. To get a precise number it has to do a full count of rows due to the nature of MVCC. There is a way to speed this up dramatically if the count does not have to be exact like it seems to be in your case.

Instead of getting the exact count (slow with big tables):

SELECT count(*) AS exact_count FROM myschema.mytable;

You get a close estimate like this (extremely fast):

SELECT reltuples::bigint AS estimate FROM pg_class where relname='mytable';

How close the estimate is depends on whether you run ANALYZE enough. It is usually very close.
See the PostgreSQL Wiki FAQ.
Or the dedicated wiki page for count(*) performance.

Better yet

The article in the PostgreSQL Wiki is was a bit sloppy. It ignored the possibility that there can be multiple tables of the same name in one database - in different schemas. To account for that:

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema'

Or better still

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

Use to_regclass('myschema.mytable') in Postgres 9.4+ to avoid exceptions for invalid table names:

  • How to check if a table exists in a given schema


TABLESAMPLE SYSTEM (n) in Postgres 9.5+

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Like @a_horse commented, the newly added clause for the SELECT command might be useful if statistics in pg_class are not current enough for some reason. For example:

  • No autovacuum running.
  • Immediately after a big INSERT or DELETE.
  • TEMPORARY tables (which are not covered by autovacuum).

This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

  • Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
  • Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
  • General rounding errors.

In most cases the estimate from pg_class will be faster and more accurate.

Answer to actual question

First, I need to know the number of rows in that table, if the total count is greater than some predefined constant,

And whether it ...

... is possible at the moment the count pass my constant value, it will stop the counting (and not wait to finish the counting to inform the row count is greater).

Yes. You can use a subquery with LIMIT:

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.



回答2:

I did this once in a postgres app by running:

EXPLAIN SELECT * FROM foo;

Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).

Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.

But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.



回答3:

In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)


回答4:

Reference taken from this Blog.

You can use below to query to find row count.

Using pg_class:

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

Using pg_stat_user_tables:

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;


回答5:

You can get the count by the below query (without * or any column names).

select from table_name;


回答6:

How wide is the text column?

With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).

I'd recommend:

  1. If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.

  2. Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column. Again, this is to decrease the workload (scan through a narrow column index)

Edit:

Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.



回答7:

For SQL Server (2005 or above) a quick and reliable method is:

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

Details about sys.dm_db_partition_stats are explained in MSDN

The query adds rows from all parts of a (possibly) partitioned table.

index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)

Even faster (but unreliable) methods are detailed here.