I want to compare the values of two columns (diff table) having comma separated values of two different Oracle tables. I want to find rows that match with all values (NAME1
all values should match with NAME2
values).
Note: The comma separated values are in different order.
Example:
T1:
ID_T1 NAME1
===================================
1 ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE
2 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID
3 CAFFEINE, PARACETAMOL PH. EUR.
4 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE
T2:
ID_T2 NAME2
=================================
4 POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
6 PARACETAMOL PH. EUR.,CAFFEINE
7 CODEINE PHOSPHATE, PARACETAMOL DC
8 DEXCHLORPHENIRAMINE MALEATE, DEXTROMETHORPHAN HYDROBROMIDE
10 DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE
MY RESULT should only show the matching row based on ALL NAME Matches in both tables.
ID_T1 ID_T2 MATCHING NAME
==================================
1 4 POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
3 6 PARACETAMOL PH. EUR.,CAFFEINE
4 10 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE
Tried with REGEXP_SUBST
but wasn't able to make it work.
I used the below code to parse the values:
SELECT REGEXP_SUBSTR (NAME1, '[^,]+', 1, ROWNUM)
FROM T1
CONNECT BY ROWNUM <= LENGTH (NAME1) -
LENGTH (REPLACE (NAME, ',')) + 1
You could get the table(s) into first normal form and then compare the compounds that are stored in each row. A starting point could be:
{1} Tokenize each row, and write the tokens into a new table. Give each token its original ID plus a 3-letter prefix, indicating which table the token came from. {2} Group the rows of the new ("normalized") table by ID, and perform a LISTAGG(). Perform a self join, and find matching "token groups".
{1} Tokenize, create table as select (CTAS)
The inspiration to tokenize without using CONNECT BY came from this SO answer.
The contents of the TOKENS table will look something like this:
{2} GROUP BY, LISTAGG, self join
When doing things this way, you can get the substances into (alphabetical) order, and you can also pick a "delimiter" that you like (we have used '+') here.
ALTERNATIVE
If all that is no use to you, or you think this is too complicated, then you could try using TRANSLATE(). In this case, I'd recommend stripping all spaces/blanks from your dataset (in a query - not altering the original data!) like so:
Query
Result
NOTE: I've added the following rows to your sample data:
I found that it is easy to use TRANSLATE() in a way that gives you "false positives" ie the substances with ids 110, 210, and 211 will appear to "match". (In other words: I don't think this is the right tool for this job.)
DBFIDDLE here
(follow the link to see the sample tables and queries).