I scraped a seller website and now need to find same product variants from title. How can I find same words for variant title?
EXAMPLE IN THE BELOW:
I scraped a seller website and now need to find same product variants from title. How can I find same words for variant title?
EXAMPLE IN THE BELOW:
I would prefer doing this via VBA however it is still doable with formula.
Looking at your sample strings, you need to remove one, two or three words from right to extract the common string.
What you need is to remove the last one word from the end, and look for an exact match using * wildcards.
If removing one word is not enough to find a match, you should do this for two words, then three words and so on if you need more.
Below is the output of the formulas for stripping one, two, and three words respectively:
CELL B2: Formula to strip one word from right and look for a match
=IFNA(IFNA(IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1)&"*",A3:$A$14,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1),""),IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1)&"*",$A$1:A1,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1),"")),"STRIPPING ONE WORD NOT ENOUGH")
CELL C2: Formula to strip one word, ifna then two words from right and look for a match
=IFNA(IFNA(IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1)&"*",A3:$A$14,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1),""),IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1)&"*",$A$1:A1,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1),"")),IFNA(IFNA(IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1)&"*",A3:$A$14,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1),""),IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1)&"*",$A$1:A1,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1),"")),"STRIPPING TWO WORDS NOT ENOUGH"))
CELL D2: Formula to strip one word, ifna then two words, ifna then three words from right and look for a match
=IFNA(IFNA(IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1)&"*",A3:$A$14,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1),""),IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1)&"*",$A$1:A1,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))))-1),"")),IFNA(IFNA(IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1)&"*",A3:$A$14,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1),""),IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1)&"*",$A$1:A1,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-1))-1),"")),IFNA(IFNA(IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-2))-1)&"*",A3:$A$14,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-2))-1),""),IF(MATCH("*"&LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-2))-1)&"*",$A$1:A1,0)>0,LEFT(A2,FIND("[",SUBSTITUTE(A2," ","[",LEN(A2)-LEN(SUBSTITUTE(A2," ",""))-2))-1),"")),"STRIPPING THREE WORDS NOT ENOUGH")))
Looking at your sample strings in your question, you should use the formula in Cell D2 since you have three or less words to be stripped. If you have four or more, follow the logic and replace "STRIPPING THREE WORDS NOT ENOUGH" with the formula for one word (and repeat this for more words). Just copy/paste the formulas for the rest of your list.
A Reminder: My list ends at A13 so I used A14 in my formula. You should change it according to the length of your list.
Another Reminder: These formulas are based on your sample strings so if you encounter any problems with this formula in your original list, then you should also include these sample strings to your question as well for a feasible solution.
You can use VLOOKUP. According to this tutorial (https://www.excel-university.com/perform-approximate-match-and-fuzzy-lookup-in-excel/),
=VLOOKUP(C7, Table1, 2, FALSE)
Where:
I would consider using Fuzzy match with Microsoft add-in if you have a Microsoft Windows operating system with Microsoft Excel 2010 or newer installed.
Download: https://www.microsoft.com/en-gb/download/details.aspx?id=15011
It is easy to use and the key advantage is that you can easily extend it when you encounter new terms. You can also manually inspect matches in a simple fashion and tweak match parameters to get better results.
You can create mapping tables to achieve your goal. Consider a simplified example as having a reference table of expected expressions which new values are compared against. You are given a score of how likely it is they are match/are the same.
From the MS Fuzzy read me:
A challenging problem in data management is that the same entity may be represented in multiple ways throughout the dataset. For instance, customer “Andy Hill” might also be present as “Mr. Andrew Hill” or “Hill, Andrew R.”. Variations can result from merging independent data sources, spelling mistakes, inconsistent naming conventions and abbreviations, or records with additional/missing information.
Fuzzy Lookup technology, developed by Microsoft Research, allows you to quickly identify data records which are textually similar. You can identify fuzzy duplicates within a single table or perform a fuzzy join between two different tables.
Additional:
1) Tutorials for Add-in
2) Geek out: If you are interested in a technical application of fuzzy matching see this amazing answer by @Alain and surrounding discussion:
Getting the closest string match
3) Technical resources on Fuzzy Lookup