Query
SELECT COUNT(*), name, number
FROM tbl
GROUP BY name, number
HAVING COUNT(*) > 1
It sometimes fails to find duplicates between lower case and upper case.
E.g.: sunny
and Sunny
don't show up as a duplicates.
So how to find all possible duplicates in PostgreSQL for two columns.
lower()
/upper()
Use one of these to fold characters to either lower or upper case. Special characters are not affected:
unaccent()
If you actually want to ignore diacritic signs, like your comments imply, install the additional module
unaccent
, which provides a text search dictionary that removes accents and also the general purpose functionunaccent()
:Makes it very simple:
Result:
This doesn't strip non-letters. Add
regexp_replace()
like @Craig mentioned for that:Result:
You can even build a functional index on top of that:
PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:
(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged
résumé
intoresume
, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language
ö
might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.You can see this in English with ligatures, where the word
dæmon
is ascii-izeddaemon
. If you stripped the ligature you'd getdmon
which wouldn't matchdaemon
, the common spelling. The same is true ofæther
which is typically ascii-ized toaether
orether
. You can also see this in German with ß, typically "expanded" asss
.If you must attempt to "un-accent", "normalize" accents or "strip" accents:
You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the
\W
escape (shorthand for the character class[^[:alnum:]_]
as per the manual) to exclude "symbols" but not accented characters:If you want to filter out accented chars too you can define your own character class:
If you actually intended to substitute some accented characters for similar unaccented characters, you could use
translate
as per this wiki article: