Get percentage of matching strings

2019-08-19 09:20发布

问题:

I have two string to match and get the percentage of matching.

Given:

String 1: John Smith Makde

String 2: Makde John Smith   

Used the following user defined scalar function.

CREATE FUNCTION [dbo].[udf_GetPercentageOfTwoStringMatching]
(
    @string1 NVARCHAR(1000)
    ,@string2 NVARCHAR(1000)
)
RETURNS INT

--WITH ENCRYPTION 
AS
BEGIN

    DECLARE @levenShteinNumber INT

    DECLARE @string1Length INT = LEN(@string1), @string2Length INT = LEN(@string2)
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END

    SELECT @levenShteinNumber = [dbo].[f_ALGORITHM_LEVENSHTEIN] (@string1  ,@string2)

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters

    -- Return the result of the function
    RETURN @percentageOfGoodCharacters

END

Function calling:

SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','Makde John Smith')    

Output:

7

But when I give both the string as same with same position:

SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','John Smith Makde')

Output:

100

Expected Result: As the both strings words are same but with different sequence I want 100% matching percentage.

100

回答1:

Doing this for millions of rows again and again will be a nightmare... I'd add another column (or a 1:1 related side table) to permantently store a normalized string. Try this:

--Create a mockup table and fill it with some dummy data

CREATE TABLE #MockUpYourTable(ID INT IDENTITY, SomeName VARCHAR(1000));
INSERT INTO #MockUpYourTable VALUES('Makde John Smith')
                                  ,('Smith John Makde')
                                  ,('Some other string')
                                  ,('string with with duplicates with');
GO

--Add a column to store the normalized strings

ALTER TABLE #MockupYourTable ADD NormalizedName VARCHAR(1000);
GO

--Use this script to split your string in fragments and re-concatenate them as canonically ordered, duplicate-free string.

UPDATE #MockUpYourTable SET NormalizedName=CAST('<x>' + REPLACE((SELECT LOWER(SomeName) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
                                            .query(N'
                                                    for $fragment in distinct-values(/x/text())
                                                    order by $fragment
                                                    return $fragment
                                                    ').value('.','nvarchar(1000)');           
GO

--Check the result

SELECT * FROM #MockUpYourTable

ID  SomeName                            NormalizedName
----------------------------------------------------------
1   Makde John Smith                    john makde smith
2   Smith John Makde                    john makde smith
3   Some other string                   other some string
4   string with with duplicates with    duplicates string with

--Clean-Up
GO
DROP TABLE #MockUpYourTable

Hint Use a trigger ON INSERT, UPDATE to keep these values synced.

Now you can use the same transformation against your strings you want this to compare with and use your former approach. Due to the re-sorting, identical fragments will return 100% similarity.



回答2:

+1 for the question. It appears you are trying to determine how similar two names are. It's hard to determine how you are doing that. I'm very familiar with the Levenshtein Distance for example but don't understand how you are trying to use it. To get you started I put together two ways you might approach this. This won't be a complete answer but rather the tools you will need to do whatever you're trying.

To compare the number of matching "name parts" you could use DelimitedSplit8K like this:

DECLARE 
  @String1 VARCHAR(100) = 'John Smith Makde Sr.',
  @String2 VARCHAR(100) = 'Makde John Smith Jr.';

SELECT COUNT(*)/(1.*LEN(@String1)-LEN(REPLACE(@string1,' ',''))+1)
FROM
(
  SELECT      s1.item
  FROM        dbo.delimitedSplit8K(@String1,' ') AS s1
  INTERSECT
  SELECT      s2.item
  FROM        dbo.delimitedSplit8K(@String2,' ') AS s2
) AS a

Here Im splitting the names into atomic values and counting which ones match. Then we divide that number by the number of values. 3/4 = .75 for 75%; 3 of the four names match.

Another method would be to use NGrams8K like so:

DECLARE 
  @String1 VARCHAR(100) = 'John Smith Makde Sr.',
  @String2 VARCHAR(100) = 'Makde John Smith Jr.';

SELECT (1.*f.L-f.MM)/f.L
FROM
(
  SELECT 
    MM = SUM(ABS(s1.C-s2.C)), 
    L  = CASE WHEN LEN(@String1)>LEN(@string2) THEN LEN(@String1) ELSE LEN(@string2) END
  FROM
  (
    SELECT s1.token, COUNT(*)
    FROM   samd.NGrams8k(@String1,1) AS s1
    GROUP BY s1.token
  ) AS s1(T,C)
  JOIN 
  (
    SELECT s1.token, COUNT(*)
    FROM   samd.NGrams8k(@String2,1) AS s1
    GROUP BY s1.token
  ) AS s2(T,C)
  ON  s1.T=s2.T  -- Letters that are equal
  AND s1.C<>s2.C -- ... but the QTY is different
) AS f;

Here we're counting the characters and substracting the mismatches. There are two (one extra J and one extra S). The longer of the two strings is 20, there are 18 characters where the letter and qty are equal. 18/20 = .9 OR 90%.

Again, what you are doing is not complicated, I would just need more detail for a better answer.