I have a SQL Server table of products, and each product has a description that is publicly available on our website. I want to prevent, or at least warn our users when, a description is too similar to another product's description. Each product's description length can greatly vary.
I'd like query for products with descriptions that include duplicate/similar paragraphs/blocks of text between one another. i.e. String A has a bunch of unique content, but shares a similar/identical paragraph w/ string B. However, I'm not sure which similarity algorithm is best to use:
The Levenshtein distance and Jaro-Winler distance algorithms appear to only work well with short strings.
I'm not sure the longest common subsequence algorithm takes into account large differences very well. i.e. it appears to ignore potential space between two characters, finding any similar combination sequence.
Fuzzy hashing sort of sounds what I'm looking for, but I'm not just looking for duplicate content w/ subtle differences. I'm also looking for duplicate content w/ subtle differences injected within a unique block of text. And I'd have no idea how to implement fuzzy hashes in SQL. SOUNDEX() and DIFFERENCE() appear to use fuzzy hashing, but are quite imprecise for my use case.
Ideally the similarity SQL function would be fast, but I could store cached similarity values in another table and schedule a job to occasionally update.
What is the best algorithm/SQL (or CLR integration) implementation to accomplish this?