I have two string to match and get the percentage of matching.
Given:
String 1: John Smith Makde
String 2: Makde John Smith
Used the following user defined scalar function.
CREATE FUNCTION [dbo].[udf_GetPercentageOfTwoStringMatching]
(
@string1 NVARCHAR(1000)
,@string2 NVARCHAR(1000)
)
RETURNS INT
--WITH ENCRYPTION
AS
BEGIN
DECLARE @levenShteinNumber INT
DECLARE @string1Length INT = LEN(@string1), @string2Length INT = LEN(@string2)
DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END
SELECT @levenShteinNumber = [dbo].[f_ALGORITHM_LEVENSHTEIN] (@string1 ,@string2)
DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber
DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters
-- Return the result of the function
RETURN @percentageOfGoodCharacters
END
Function calling:
SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','Makde John Smith')
Output:
7
But when I give both the string as same with same position:
SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','John Smith Makde')
Output:
100
Expected Result: As the both strings words are same but with different sequence I want 100% matching percentage.
100
Doing this for millions of rows again and again will be a nightmare... I'd add another column (or a 1:1
related side table) to permantently store a normalized string. Try this:
--Create a mockup table and fill it with some dummy data
CREATE TABLE #MockUpYourTable(ID INT IDENTITY, SomeName VARCHAR(1000));
INSERT INTO #MockUpYourTable VALUES('Makde John Smith')
,('Smith John Makde')
,('Some other string')
,('string with with duplicates with');
GO
--Add a column to store the normalized strings
ALTER TABLE #MockupYourTable ADD NormalizedName VARCHAR(1000);
GO
--Use this script to split your string in fragments and re-concatenate them as canonically ordered, duplicate-free string.
UPDATE #MockUpYourTable SET NormalizedName=CAST('<x>' + REPLACE((SELECT LOWER(SomeName) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
--Check the result
SELECT * FROM #MockUpYourTable
ID SomeName NormalizedName
----------------------------------------------------------
1 Makde John Smith john makde smith
2 Smith John Makde john makde smith
3 Some other string other some string
4 string with with duplicates with duplicates string with
--Clean-Up
GO
DROP TABLE #MockUpYourTable
Hint Use a trigger ON INSERT, UPDATE
to keep these values synced.
Now you can use the same transformation against your strings you want this to compare with and use your former approach. Due to the re-sorting, identical fragments will return 100% similarity.
+1 for the question. It appears you are trying to determine how similar two names are. It's hard to determine how you are doing that. I'm very familiar with the Levenshtein Distance for example but don't understand how you are trying to use it. To get you started I put together two ways you might approach this. This won't be a complete answer but rather the tools you will need to do whatever you're trying.
To compare the number of matching "name parts" you could use DelimitedSplit8K like this:
DECLARE
@String1 VARCHAR(100) = 'John Smith Makde Sr.',
@String2 VARCHAR(100) = 'Makde John Smith Jr.';
SELECT COUNT(*)/(1.*LEN(@String1)-LEN(REPLACE(@string1,' ',''))+1)
FROM
(
SELECT s1.item
FROM dbo.delimitedSplit8K(@String1,' ') AS s1
INTERSECT
SELECT s2.item
FROM dbo.delimitedSplit8K(@String2,' ') AS s2
) AS a
Here Im splitting the names into atomic values and counting which ones match. Then we divide that number by the number of values. 3/4 = .75 for 75%; 3 of the four names match.
Another method would be to use NGrams8K like so:
DECLARE
@String1 VARCHAR(100) = 'John Smith Makde Sr.',
@String2 VARCHAR(100) = 'Makde John Smith Jr.';
SELECT (1.*f.L-f.MM)/f.L
FROM
(
SELECT
MM = SUM(ABS(s1.C-s2.C)),
L = CASE WHEN LEN(@String1)>LEN(@string2) THEN LEN(@String1) ELSE LEN(@string2) END
FROM
(
SELECT s1.token, COUNT(*)
FROM samd.NGrams8k(@String1,1) AS s1
GROUP BY s1.token
) AS s1(T,C)
JOIN
(
SELECT s1.token, COUNT(*)
FROM samd.NGrams8k(@String2,1) AS s1
GROUP BY s1.token
) AS s2(T,C)
ON s1.T=s2.T -- Letters that are equal
AND s1.C<>s2.C -- ... but the QTY is different
) AS f;
Here we're counting the characters and substracting the mismatches. There are two (one extra J and one extra S). The longer of the two strings is 20, there are 18 characters where the letter and qty are equal. 18/20 = .9 OR 90%.
Again, what you are doing is not complicated, I would just need more detail for a better answer.