SQL — How is DISTINCT so fast without an index?

2019-01-29 11:46发布

问题:

I have a database with a table called 'links' with 600 million rows in it in SQLite. There are 2 columns in the database - a "src" column and a "dest" column. At present there are no indices.

There are a fair number of common values between src and dest, but also a fair number of duplicated rows.

The first thing I'm trying to do is remove all the duplicate rows, and then perform some additional processing on the results, however I've been encountering some weird issues.

Firstly, SELECT * FROM links WHERE src=434923 AND dest=5010182. Now this returns one result fairly quickly and then takes quite a long time to run as I assume it's performing a tablescan on the rest of the 600m rows.

However, if I do SELECT DISTINCT * FROM links, then it immediately starts returning rows really quickly. The question is: how is this possible?? Surely for each row, the row must be compared against all of the other rows in the table, but this would require a tablescan of the remaining rows in the table which SHOULD takes ages!

Any ideas why SELECT DISTINCT is so much quicker than a standard SELECT?

回答1:

Think about it. With no ordering applied it can return results in scan order. It just keeps a list (more likely, an efficient struct like a b-tree) of the values seen so far. If a given value isn't found it's returned and added to the bookkeeping structure. Absolutely no need to compare with all the other rows at all.



回答2:

To be more accurate, one query is not quicker than the other. More precisely, the amount of time taken until the query is completed should be the same for both queries. The difference is that the query with DISTINCT simply has more rows to return therefore it appears to respond faster since you are recieving rows at a fast rate. However, what is happening under the hood of both is the same table scan. The distinct query has a data structure storing what has been returned and filters duplicates. Therefore, it SHOULD actually take longer until the query completes but (rows returned)/time is larger since there are simply more rows that match. (Also note: some viewers add a query result limit which can make the distinct query appear to run faster (since you hit the result limit and stop)).