SQL Server, Using UNION ALL for multiple tables th

2019-04-04 15:08发布

问题:

I need help also about paging and using UNION ALL for multiple tables:

How do i implement an optimized paging when joining multiple tables using UNION ALL and returning only specific number of rows...


declare @startRow int
declare @PageCount int

set @startRow = 0
set @PageCount = 20

set rowcount @PageCount

select Row_Number() OVER(Order by col1) as RowNumber, col1, col2
from
(
    select col1, col2 from table1 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table2 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table3 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table4 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table5 where datetimeCol between (@dateFrom and @dateTo)
) as tmpTable
where RowNumber > @startRow

table 3, 4, & 5 have huge number of row (millions of rows) where table 1 & 2 may only have few thousand rows.

If startRow is "0", I only expect data from Row 1 to 20 (from Table1). I'm getting the correct result but has a high overhead on the remaining table while sql server tries to all all the data and filter it....

the longer the interval of the @dateFrom and @dateTo makes my query significantly slower while trying to retrieve only few rows from the overall result set

Please help how i can implement a simple but better approach with a similar logic. :(

回答1:

Consider using OFFSET FETCH clause (works starting with MSSQL 2012):

declare @startRow int
declare @PageCount int

set @startRow = 0
set @PageCount = 20


select col1, col2
from
(
    select col1, col2 from table1 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table2 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table3 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table4 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from table5 where datetimeCol between (@dateFrom and @dateTo)
) as tmpTable
order by col1
offset @startRow rows
fetch next @PageCount rows only

I also want to mention here, why this query always takes O(n*log(n)) time.
To execute this query, database needs to:

  1. to union multiple lists into one list - takes O(n) time for each table, where n - total number of rows in your tables;
  2. sort list by col1 - takes O(n*log(n)), where n - is total number of rows
  3. traverse the list in sorted order, skip @startRow rows, take next @PageCount rows.

As you can see, you need to union and sort all data to get the expected result (number 3).

If the performance of this query is still poor and you want to increase in, try to:

  • create clustred index based on col1 in all tables
  • create a non-clusteres index based on col1 in all tables and **inlude all other columns that you want to output in select list**.


回答2:

Instead of applying classic OFFSET based paging (note that SQL Server 2012 now natively supports it), I think your particular use case could greatly profit from a method often referred to as the "seek method" as described in this blog post here. Your query would then look like this.

select top 20 col1, col2
from
(
    select col1, col2 from t1 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t2 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t3 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t4 where datetimeCol between (@dateFrom and @dateTo)
    union all
    select col1, col2 from t5 where datetimeCol between (@dateFrom and @dateTo)
) as tmpTable
where (col1 > @lastValueForCol1)
   or (col1 = @lastValueForCol1 and col2 > @lastValueForCol2)
order by col1, col2

The @lastValueForCol1 and @lastValueForCol2 values are the respective values of the last record from the previous page. This allows you to fetch the "next" page. If the ORDER BY direction is DESC, simply use < instead. If (col1, col2) is not globally unique across your tmpTable, you may need to add another column to the query and to the WHERE and ORDER BY clause to avoid losing records between pages.

With the above method, you cannot immediately jump to page 3 without having first fetched the previous 40 records. But often, you do not want to jump that far anyway. Instead, you get a much faster query that might be able to fetch data in constant time, depending on your indexing. Plus, your pages remain "stable", no matter if the underlying data changes (e.g. on page 1, while you're on page 4).

Note, the "seek method" is also called keyset paging.

Indexing

While paging with the "seek method" is always faster than when using OFFSET, you should still make sure that (col1, col2) is indexed in each one of your tables!



回答3:

Since the tables are ordered in the result set for the paging (Union ALL does not sort), there is no reason to select from all 5 tables. You should change the code to :

  • Query from Table 1. See if you have enough records.
  • If not Query from Table 2, and so on.

Managing the offset count according to the number of records queried each time. This way, you only query the tables you need.

You can even optimize by selecting the number of records in a table according to the filter to know if you need to query any data from it. So if you want records 30-50, and table1 only has 20 matching records in it, you can skip it altogether.



回答4:

select col1, col2 from table1 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table2 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table3 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table4 where datetimeCol between (@dateFrom and @dateTo)
union all
select col1, col2 from table5 where datetimeCol between (@dateFrom and @dateTo)

is essentially as efficient as a normal table provided there is an index on the sorting key that you use for paging. This usually results in a query plan where all tables are merge concatenated. Merge concatenating is a streaming operation. It's cost is proportional to the number of rows drawn, not the number of rows in the tables.

Paging by row number in SQL Server always works by enumerating rows start to end until the desired window is reached. Whether the rows are drawn from a table or from multiple merged tables does not make a fundamental difference.

So a good chance to make this fast is to create a covering index keyed on col1. Unfortunately, it's not possible to index for between (@dateFrom and @dateTo) at the same time. So you have to try both indexing strategies and pick what works best.



回答5:

There maybe an issue with your database design since you have 5 similar tables. But besides this, you could materialize your UNION ALL query into a permanent table or a temp #-table with appropriate indexes on it and finally paginate over materialized data set with ROW_NUMBER() clause.