Select distinct … inner join vs. select … where id

2019-02-24 22:09发布

问题:

I'm trying to create a subset of a table (as a materialized view), defined as those records which have a matching record in another materialized view.

For example, let's say I have a Users table with user_id and name columns, and a Log table, with entry_id, user_id, activity, and timestamp columns.

First I create a materialized view of the Log table, selecting only those rows with timestamp > some_date. Now I want a materliazed view of the Users referenced in my snapshot of the Log table. I can either create it as

select * from Users where user_id in (select user_id from Log_mview)

or I can do

select distinct u.* from Users u inner join Log_mview l on u.user_id = l.user_id

(need the distinct to avoid multiple hits from users with multiple log entries).

The former seems cleaner and more elegant, but takes much longer. Am I missing something? Is there a better way to do this?

Edit: The where exists clause helped a lot, except in the case where the condition uses an OR. For example, let's say the Log table above also had a user_name column, and the correct way to match a Log entry to a Users record is when either of the columns (user id or user name) match. I'm finding that

select distinct u.* from Users u
    inner join Log_mview l
        on u.user_id = l.user_id or u.name = l.user_name

is much faster than

select * from Users u where exists
    (select id from Log_mview l 
        where l.user_id = u.user_id or l.user_name = u.name)

Any help?

(Regarding the explain plan... Lemme work on sanitizing it, or them, rather... I'll post them in a while.)

Edit: explain plans: For the query with inner join:

Plan hash value: 436698422

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                     |  4539K|   606M|       |   637K  (3)| 02:07:25 |
|   1 |  HASH UNIQUE                    |                     |  4539K|   606M|  3201M|   637K  (3)| 02:07:25 |
|   2 |   CONCATENATION                 |                     |       |       |       |            |          |
|*  3 |    HASH JOIN                    |                     |  4206K|   561M|    33M|   181K  (4)| 00:36:14 |
|   4 |     BITMAP CONVERSION TO ROWIDS |                     |   926K|    22M|       |  2279   (1)| 00:00:28 |
|   5 |      BITMAP INDEX FAST FULL SCAN| I_M_LOG_MVIEW_4     |       |       |       |            |          |
|*  6 |     TABLE ACCESS FULL           | USERS               |    15M|  1630M|       | 86638   (6)| 00:17:20 |
|*  7 |    HASH JOIN                    |                     |  7646K|  1020M|    33M|   231K  (4)| 00:46:13 |
|   8 |     BITMAP CONVERSION TO ROWIDS |                     |   926K|    22M|       |  2279   (1)| 00:00:28 |
|   9 |      BITMAP INDEX FAST FULL SCAN| I_M_LOG_MVIEW_4     |       |       |       |            |          |
|  10 |     TABLE ACCESS FULL           | USERS               |    23M|  2515M|       | 87546   (7)| 00:17:31 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("U"."NAME"="L"."USER_NAME")
   6 - filter("U"."NAME" IS NOT NULL)
   7 - access("U"."USER_ID"=TO_NUMBER("L"."USER_ID"))
       filter(LNNVL("U"."NAME"="L"."USER_NAME") OR LNNVL("U"."NAME" IS NOT NULL))

Note
-----
   - dynamic sampling used for this statement

For the one using where exists:

Plan hash value: 2786958565

-----------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                     |     1 |   114 |    21M  (1)| 70:12:13 |
|*  1 |  FILTER                       |                     |       |       |            |          |
|   2 |   TABLE ACCESS FULL           | USERS               |    23M|  2515M| 87681   (7)| 00:17:33 |
|   3 |   BITMAP CONVERSION TO ROWIDS |                     |  7062 |   179K|     1   (0)| 00:00:01 |
|*  4 |    BITMAP INDEX FAST FULL SCAN| I_M_LOG_MVIEW_4     |       |       |            |          |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( EXISTS (SELECT /*+ */ 0 FROM "MYSCHEMA"."LOG_MVIEW" 
              "LOG_MVIEW" WHERE ("USER_NAME"=:B1 OR TO_NUMBER("USER_ID")=:B2) AND 
              ("USER_NAME"=:B3 OR TO_NUMBER("USER_ID")=:B4) AND ("USER_NAME"=:B5 OR 
              TO_NUMBER("USER_ID")=:B6)))
   4 - filter("USER_NAME"=:B1 OR TO_NUMBER("USER_ID")=:B2)

Note
-----
   - dynamic sampling used for this statement

DB object names changed to protect the innocent. :p

回答1:

Try this

select * from Users u
where exists 
   ( select user_id 
     from Log_mview l
     where l.user_id = u.user_id )
/

If the sub-query returns a large number of rows WHERE EXISTS can be substantially faster than WHERE ... IN.



回答2:

This will depend on the data you have, but using Distinct within the join could improve your performance:

Select u.*
From Users u
Join ( Select Distinct user_id
       From log_mview ) l On u.user_id = l.user_id


回答3:

The second query is probably working more the harddrive than the first query (join+distinc).

The first query will probably translates to something like:

for each row in table Log find corresponding row in table User (in memory).

The database is probably smart enough to create in memory structures for table User that is probably much smaller than Log table.

I believe that query one (join+distinct) will require only one pass on table Log.

The distinct is probably executed in memory.

The second query probably forces the database to do multiples fulls reads on table Log.

So in the second query you probably get:

For each row in table user read all the rows in table Log (from disk) in order to match the condition.

You have also to consider that some query may experience a dramatic diference in speed due to changes in memory availability, load and table increase.