NOT IN query… odd results

2019-07-07 06:50发布

问题:

I need a list of users in one database that are not listed as the new_user_id in another. There are 112,815 matching users in both databases; user_id is the key in all queries tables.


Query #1 works, and gives me 111,327 users who are NOT referenced as a new_user_Id. But it requires querying the same data twice.

-- 111,327 GSU users are NOT listed as a CSS new user 
--   1,488 GSU users ARE listed as a new user in CSS
--
select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id, cud.new_user_id, cud.user_type_code
        from   css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id
where  gup.user_id not in (select cud.new_user_id 
                           from   css.user_desc cud
                           where  cud.new_user_id is not null);


Query #2 would be perfect... and I'm actually surprised that it's syntactically accepted. But it gives me a result that makes no sense.

-- This gives me 1,505 users... I've checked, and they are not
-- referenced as new_user_ids in CSS, but I don't know why the ones 
-- that were excluded were excluded.
--
-- Where are the missing 109,822, and whatexcluded them?
-- 
select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id, cud.new_user_id, cud.user_type_code
        from   css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id
where  gup.user_id not in (cudsubq.new_user_id);


What exactly is the where clause in the second query doing, and why is it excluding 109,822 records from the results?


Note The above query is a simplification of what I'm really after. There are other/better ways to do the above queries... they're just representative of the part of the query that's giving me problems.

回答1:

Read this: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::NO::P11_QUESTION_ID:442029737684

For what I understand, your cudsubq.new_user_id can be NULL even though both tables are joined by user_id, so, you won't get results using the NOT IN operator when the subset contains NULL values . Consider the example in the article:

select * from dual where dummy not in ( NULL )

This returns no records. Try using the NOT EXISTS operator or just another kind of join. Here is a good source: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

And what you need is the fourth example:

SELECT COUNT(descr.user_id)
FROM 
    user_profile prof
    LEFT OUTER JOIN user_desc descr
        ON prof.user_id = descr.user_id 
WHERE descr.new_user_id IS NULL
    OR descr.new_user_id != prof.user_id


回答2:

Second query is semantically different. In this case

where  gup.user_id not in (cudsubq.new_user_id)

cudsubq.new_user_id is treated as expression (doc: IN condition), not as a subquery, thus the whole clause is basically equivalent to

where  gup.user_id != cudsubq.new_user_id

So, in your first query, you're literally asking "show me all users in GUP, who also have entries in CSS and their GUP.ID is not matching ANY NOT NULL NEW_ID in CSS ".

However, the second query is "show me all users in GUP, who also have entries in CSS and their GUP.ID is not equal to their RESPECTIVE NULLABLE (no is not null clause, remember?) CSS.NEW_ID value".

And any (not) in (or equality/inequality) checks with nulls don't actually work.

12:07:54 SYSTEM@oars_sandbox> select * from dual where 1 not in (null, 2, 3, 4);

no rows selected                                                   

Elapsed: 00:00:00.00          

This is where you lose your rows. I would probably rewrite your second query's where clause as where cudsubq.new_user_id is null, assuming that non-matching users have null new_user_id.



回答3:

Your second select compares gup.user_id with cud.new_user_id on current joining record. You can rewrite the query to get the same result

select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id, cud.new_user_id, cud.user_type_code
        from   css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id
where  gup.user_id != cud.new_user_id or cud.new_user_id is null;

You mentioned you compare list of user in one database with a list of users in another. So you need to query data twice and you don't query the same data. Maybe you can use "minus" operator to avoid using "in"

select count(gup.user_id)
from   gsu.user_profile gup
  join (select cud.user_id from css.user_desc cud
        minus
        select cud.new_user_id from css.user_desc cud) cudsubq
    on gup.user_id = cudsubq.user_id;


回答4:

You want new_user_id's from table gup that don't match any new_user_id on table cud, right? It sounds like a job for a left join:

SELECT count(gup.user_id)
    FROM gsu.user_profile gup LEFT JOIN css.user_desc cud
        ON gup.user_id = cud.new_user_id
    WHERE cud.new_user_id is NULL

The join keeps all rows of gup, matching them with a new_user_id if possible. The WHERE condition keeps only the rows that have no matching row in cud.

(Apologies if you know this already and you're only interested in the behavior of the not in query)



标签: oracle notin