I have a query requirement from ----. Trying to solve it with CONNECT BY
, but can't seem to get the results I need.
Table (simplified):
create table CSS.USER_DESC (
USER_ID VARCHAR2(30) not null,
NEW_USER_ID VARCHAR2(30),
GLOBAL_HR_ID CHAR(8)
)
-- USER_ID is the primary key
-- NEW_USER_ID is a self-referencing key
-- GLOBAL_HR_ID is an ID field from another system
There are two sources of user data (datafeeds)... I have to watch for mistakes in either of them when updating information.
Scenarios:
- A user is given a new User ID... The old record is set accordingly and deactivated (typically a rename for contractors who become fulltime)
- A user leaves and returns sometime later. HR fails to send us the old user ID so we can connect the accounts.
- The system screwed up and didn't set the new User ID on the old record.
- The data can be bad in a hundred other ways
I need to know the following are the same user, and I can't rely on name or other fields... they differ among matching records:
ROOTUSER NUMROOTS NODELEVEL ISLEAF USER_ID NEW_USER_ID GLOBAL_HR_ID USERTYPE LAST_NAME FIRST_NAME
-----------------------------------------------------------------------------------------------------------------------------
EX0T1100 2 1 0 EX0T1100 EX000005 CONTRACTOR VON DER HAAVEN VERONICA
EX0T1100 2 2 1 EX000005 00126121 EMPLOYEE HAAVEN, VON DER VERONICA
GL110456 1 1 1 GL110456 00126121 EMPLOYEE VONDERHAAVEN VERONICA
EXOT1100
and EX000005
are connected properly by the NEW_USER_ID
field. The rename occurred before there were global HR IDs, so EX0T1100
doesn't have one. EX000005
was given a new user ID, 'GL110456', and the two are only connected by having the same global HR ID.
Cleaning up the data isn't an option.
The query so far:
select connect_by_root cud.user_id RootUser,
count(connect_by_root cud.user_id) over (partition by connect_by_root cud.user_id) NumRoots,
level NodeLevel, connect_by_isleaf IsLeaf, --connect_by_iscycle IsCycle,
cud.user_id, cud.new_user_id, cud.global_hr_id,
cud.user_type_code UserType, ccud.last_name, cud.first_name
from css.user_desc cud
where cud.user_id in ('EX000005','EX0T1100','GL110456')
-- Using this so I don't get sub-users in my list of root users...
-- It complicates the matches with GLOBAL_HR_ID, however
start with cud.user_id not in (select cudsub.new_user_id
from css.user_desc cudsub
where cudsub.new_user_id is not null)
connect by nocycle (prior new_user_id = user_id);
I've tried various CONNECT BY
clauses, but none of them are quite right:
-- As a multiple CONNECT BY
connect by nocycle (prior global_hr_id = global_hr_id)
connect by nocycle (prior new_user_id = user_id)
-- As a compound CONNECT BY
connect by nocycle ((prior new_user_id = user_id)
or (prior global_hr_id = global_hr_id
and user_id != prior user_Id))
UNIONing two CONNECT BY queries doesn't work... I don't get the leveling.
Here is what I would like to see... I'm okay with a resultset that I have to distinct and use as a subquery. I'm also okay with any of the three user IDs in the ROOTUSER column... I just need to know they're the same users.
ROOTUSER NUMROOTS NODELEVEL ISLEAF USER_ID NEW_USER_ID GLOBAL_HR_ID USERTYPE LAST_NAME FIRST_NAME
-----------------------------------------------------------------------------------------------------------------------------
EX0T1100 3 1 0 EX0T1100 EX000005 CONTRACTOR VON DER HAAVEN VERONICA
EX0T1100 3 2 1 EX000005 00126121 EMPLOYEE HAAVEN, VON DER VERONICA
EX0T1100 3 (2 or 3) 1 GL110456 00126121 EMPLOYEE VONDERHAAVEN VERONICA
Ideas?
Update
Nicholas, your code looks very much like the right track... at the moment, the lead(user_id) over (partition by global_hr_id)
gets false hits when the global_hr_id
is null. For example:
USER_ID NEW_USER_ID CHAINNEWUSER GLOBAL_HR_ID LAST_NAME FIRST_NAME
FP004468 FP004469 AARON TIMOTHY
FP004469 FOONG KOK WAH
I've often wanted to treat nulls as separate records in a partition, but I've never found a way to make ignore nulls
work. This did what I wanted:
decode(global_hr_id,null,null,lead(cud.user_id ignore nulls) over (partition by global_hr_id order by user_id)
... but there's got to be a better way. I haven't been able to get the query to finish yet on the full-blown user data (about 40,000 users). Both global_hr_id
and new_user_id
are indexed.
Update
The query returns after about 750 seconds... long, but manageable. It returns 93k records, because I don't have a good way of filtering level 2 hits out of the root - you have start with global_hr_id is null
, but unfortunately, that isn't always the case. I'll have to think some more about how to filter those out.
I've tried adding more complex start with clauses before, but I find that separately, they run < 1 second... together, they take 90 minutes >.<
Thanks again for you help... plodding away at this.
You have provided sample of data for only one user. Would be better to have a little bit more. Anyway, lets look at something like this.
Result: