Slow distinct query in SQL Server over large datas

2019-03-25 06:47发布

站内文章 / SQL-Server

59 0

三岁会撩人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.

The table has 3 columns:

id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)

One of the queries we run is a distinct on the typeName field:

SELECT DISTINCT [typeName] FROM [types] WITH (nolock);

The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.

To test this a little further, we ran the following query:

SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]

This query returns in about 10 seconds, as I would expect, it's scanning the table.

Is there something we're missing here? Why does the first query take so long?

Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.

Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...

SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.

Once again, thank you all for your help!!

回答1:

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?

回答2:

I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.

If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.

回答3:

There is an isse with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.

So we too queries such as:

SELECT DISTINCT [typeName] FROM [types] WITH (nolock);

and break it up into the following

SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1

Another way to get around it is to use a GROUP BY, which gets a different optimization plan.

回答4:

As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.

So it's really a matter of limiting the number of rows that need to be scanned.

The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??

If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.

That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.

Marc

回答5:

My first thought is statistics. To find last updated:

SELECT
    name AS index_name, 
    STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
    sys.indexes 
WHERE
    object_id = OBJECT_ID('MyTable');

Edit: Stats are updated when indexes are rebuilt, which I see are not maintained

My second thought is that is the index still there? The TOP query should still use an index. I've just tested on one of my tables with 57 million rows and both use the index.

回答6:

A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).

Idea was from this question:

select typeName into #Result from Types where 1=0;

declare @t varchar(100) = (select min(typeName) from Types);
while @t is not null
begin
    set @t = (select top 1 typeName from Types where typeName > @t order by typeName);    
    if (@t is not null)
        insert into #Result values (@t);
end

select * from #Result;

And looks like there are also some other methods (notably the recursive CTE @Paul White):

different-ways-to-find-distinct-values-faster-methods

sqlservercentral Topic873124-338-5

回答7:

I should try something like this:

SELECT typeName FROM [types] WITH (nolock)
group by typeName;

And like other i would say you need to normalize that column.

回答8:

An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.

You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:

select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid

Another option is to normalize the big table into two tables:

talbe1: id, guid, typeid
type table: typeid, typename

This would be very beneficial if the number of types was relatively small.

回答9:

I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?

This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.

It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.

回答10:

An indexed view can make this faster.

create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename

create unique clustered index idx
on alltypes (typename)

The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)

Alternatively you could make a small table holding all values:

select distinct typename
into alltypes
from types

alter table alltypes
add primary key (typename)

alter table types add foreign key (typename) references alltypes

The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.