SQLite: COUNT slow on big tables

2020-05-24 20:07发布

I'm having a performance problem in SQLite with a SELECT COUNT(*) on a large tables.

As I didn't yet receive a usable answer and I did some further testing, I edited my question to incorporate my new findings.

I have 2 tables:

CREATE TABLE Table1 (
Key INTEGER NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL,
CONSTRAINT PK_Table1 PRIMARY KEY (Key ASC))

CREATE Table2 (
Key INTEGER NOT NULL,
Key2 INTEGER NOT NULL,
... a few other fields ...,
CONSTRAINT PK_Table2 PRIMARY KEY (Key ASC, Key2 ASC))

Table1 has around 8 million records and Table2 has around 51 million records, and the databasefile is over 5GB.

Table1 has 2 more indexes:

CREATE INDEX IDX_Table1_Status ON Table1 (Status ASC, Key ASC)
CREATE INDEX IDX_Table1_Selection ON Table1 (Selection ASC, Key ASC)

"Status" is required field, but has only 6 distinct values, "Selection" is not required and has only around 1.5 million values different from null and only around 600k distinct values.

I did some tests on both tables, you can see the timings below, and I added the "explain query plan" for each request (QP). I placed the database file on an USB-memorystick so i could remove it after each test and get reliable results without interference of the disk cache. Some requests are faster on USB (I suppose due to lack of seektime), but some are slower (table scans).

SELECT COUNT(*) FROM Table1
    Time: 105 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 153 sec
    QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Key = 5123456
    Time: 5 ms
    QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 16 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)
SELECT * FROM Table1 WHERE Selection = 'SomeValue' AND Key > 5123456 LIMIT 1
    Time: 9 ms
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Selection (Selection=?) (~3 rows)

As you can see the counts are very slow, but normal selects are fast (except for the 2nd one, which took 16 seconds).

The same goes for Table2:

SELECT COUNT(*) FROM Table2
    Time: 528 sec
    QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~1000000 rows)
SELECT COUNT(Key) FROM Table2
    Time: 249 sec
    QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table2 WHERE Key = 5123456 AND Key2 = 0
    Time: 7 ms
    QP: SEARCH TABLE Table2 USING INDEX sqlite_autoindex_Table2_1 (Key=? AND Key2=?) (~1 rows)

Why is SQLite not using the automatically created index on the primary key on Table1 ? And why, when he uses the auto-index on Table2, it still takes a lot of time ?

I created the same tables with the same content and indexes on SQL Server 2008 R2 and there the counts are nearly instantaneous.

One of the comments below suggested executing ANALYZE on the database. I did and it took 11 minutes to complete. After that, I ran some of the tests again:

SELECT COUNT(*) FROM Table1
    Time: 104 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~7848023 rows)
SELECT COUNT(Key) FROM Table1
    Time: 151 sec
    QP: SCAN TABLE Table1 (~7848023 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 5 ms
    QP: SEARCH TABLE Table1 USING INTEGER PRIMARY KEY (rowid>?) (~196200 rows)
SELECT COUNT(*) FROM Table2
    Time: 529 sec
    QP: SCAN TABLE Table2 USING COVERING INDEX sqlite_autoindex_Table2_1(~51152542 rows)
SELECT COUNT(Key) FROM Table2
    Time: 249 sec
    QP: SCAN TABLE Table2 (~51152542 rows)

As you can see, the queries took the same time (except the query plan is now showing the real number of rows), only the slower select is now also fast.

Next, I create dan extra index on the Key field of Table1, which should correspond to the auto-index. I did this on the original database, without the ANALYZE data. It took over 23 minutes to create this index (remember, this is on an USB-stick).

CREATE INDEX IDX_Table1_Key ON Table1 (Key ASC)

Then I ran the tests again:

SELECT COUNT(*) FROM Table1
    Time: 4 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Key(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 167 sec
    QP: SCAN TABLE Table2 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 17 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)

As you can see, the index helped with the count(*), but not with the count(Key).

Finaly, I created the table using a column constraint instead of a table constraint:

CREATE TABLE Table1 (
Key INTEGER PRIMARY KEY ASC NOT NULL,
... several other fields ...,
Status CHAR(1) NOT NULL,
Selection VARCHAR NULL)

Then I ran the tests again:

SELECT COUNT(*) FROM Table1
    Time: 6 sec
    QP: SCAN TABLE Table1 USING COVERING INDEX IDX_Table1_Selection(~1000000 rows)
SELECT COUNT(Key) FROM Table1
    Time: 28 sec
    QP: SCAN TABLE Table1 (~1000000 rows)
SELECT * FROM Table1 WHERE Status = 73 AND Key > 5123456 LIMIT 1
    Time: 10 sec
    QP: SEARCH TABLE Table1 USING INDEX IDX_Table1_Status (Status=?) (~3 rows)

Although the query plans are the same, the times are a lot better. Why is this ?

The problem is that ALTER TABLE does not permit to convert an existing table and I have a lot of existing databases which i can not convert to this form. Besides, using a column contraint instead of table constraint won't work for Table2.

Has anyone any idea what I am doing wrong and how to solve this problem ?

I used System.Data.SQLite version 1.0.74.0 to create the tables and to run the tests I used SQLiteSpy 1.9.1.

Thanks,

Marc

8条回答
爷、活的狠高调
2楼-- · 2020-05-24 20:22

From http://old.nabble.com/count(*)-slow-td869876.html

SQLite always does a full table scan for count(*). It does not keep meta information on tables to speed this process up.

Not keeping meta information is a deliberate design decision. If each table stored a count (or better, each node of the B-tree stored a count) then much more updating would have to occur on every INSERT or DELETE. This would slow down INSERT and DELETE, even in the common case where count(*) speed is unimportant.

If you really need a fast COUNT, then you can create a trigger on INSERT and DELETE that updates a running count in a separate table then query that separate table to find the latest count.

Of course, it's not worth keeping a FULL row count if you need COUNTs dependent on WHERE clauses (i.e. WHERE field1 > 0 and field2 < 1000000000).

查看更多
我只想做你的唯一
3楼-- · 2020-05-24 20:31

Do not count the stars, count the records! Or in other language, never issue

SELECT COUNT(*) FROM tablename;

use

SELECT COUNT(ROWID) FROM tablename;

Call EXPLAIN QUERY PLAN for both to see the difference. Make sure you have an index in place containing all columns mentioned in the WHERE clause.

查看更多
做自己的国王
4楼-- · 2020-05-24 20:32

I had the same problem, in my situation VACUUM command helped. After its execution on database COUNT(*) speed increased near 100 times. However, command itself needs some minutes in my database (20 millions records). I solved this problem by running VACUUM when my software exits after main window destruction, so the delay doesn't make problems to user.

查看更多
Juvenile、少年°
5楼-- · 2020-05-24 20:33

On the matter of the column constraint, SQLite maps columns that are declared to be INTEGER PRIMARY KEY to the internal row id (which in turn admits a number of internal optimizations). Theoretically, it could do the same for a separately-declared primary key constraint, but it appears not to do so in practice, at least with the version of SQLite in use. (System.Data.SQLite 1.0.74.0 corresponds to core SQLite 3.7.7.1. You might want to try re-checking your figures with 1.0.79.0; you shouldn't need to change your database to do that, just the library.)

查看更多
够拽才男人
6楼-- · 2020-05-24 20:36

Here's a potential workaround to improve the query performance. From the context, it sounds like your query takes about a minute and a half to run.

Assuming you have a date_created column (or can add one), run a query in the background each day at midnight (say at 00:05am) and persist the value somewhere along with the last_updated date it was calculated (I'll come back to that in a bit).

Then, running against your date_created column (with an index), you can avoid a full table scan by doing a query like SELECT COUNT(*) FROM TABLE WHERE date_updated > "[TODAY] 00:00:05".

Add the count value from that query to your persisted value, and you have a reasonably fast count that's generally accurate.

The only catch is that from 12:05am to 12:07am (the duration during which your total count query is running) you have a race condition which you can check the last_updated value of your full table scan count(). If it's > 24 hours old, then your incremental count query needs to pull a full day's count plus time elapsed today. If it's < 24 hours old, then your incremental count query needs to pull a partial day's count (just time elapsed today).

查看更多
贪生不怕死
7楼-- · 2020-05-24 20:38

This may not help much, but you can run the ANALYZE command to rebuild statistics about your database. Try running "ANALYZE;" to rebuild statistics about the entire database, then run your query again and see if it is any faster.

查看更多
登录 后发表回答