This question is similar to How to emulate MySQLs utf8_general_ci collation in PHP string comparisons but I want the function for vb.net rather than PhP.
Recently I make a lot of supposedly unique key.
Some of the keys are equivalent under UTF8 unicode collation.
For example, look at these 2 key:
byers-street-bistro__38.15_-79.07 byers-street-bistro__38.15_-79.07
If I paste that into front page, and look at the source code you'll see
byers-street-bistro__38.15_-79.07
byers-street-bistro__38.15_-79.07
Note: In stack overflow they still look different.
I know it's not the same. I guess even in stack exchange it doesn't show. Say I have 1 million such records and I want to test whether 2 string WILL be declared the same by MySQL UTF8 collation. I want to know that before uploading. How do I do that.
So vb.net think that those are different keys. When we created mysql query and upload that to database, the database complain it's the same key. Just one complain and the upload of 1 million databases will be stuck.
We don't even know what the hell is ? Where can we look that up anyway?
Anyway, I want a function that when given 2 strings will tell me whether they will count as the same or not.
If possible we want a function that convert strings into their most "standard" form.
For example, seems to encode nothing and the function would recoqnize all those nothing character and eliminate that.
Is there such thing?
So far this is what I do. I need something more comprehensive.
Private Function StraightenQuotesReplacement() As Generic.Dictionary(Of String, String)
Static replacement As Generic.Dictionary(Of String, String)
If replacement Is Nothing Then
replacement = New Generic.Dictionary(Of String, String)
replacement.Add(ChrW(&H201C), """")
replacement.Add(ChrW(&H201D), """")
replacement.Add(ChrW(&H2018), "'")
replacement.Add(ChrW(&H2019), "'")
End If
Return replacement
End Function
<Extension()>
Public Function straightenQuotes(ByVal somestring As String) As String
For Each key In StraightenQuotesReplacement.Keys
somestring = somestring.Replace(key, StraightenQuotesReplacement.Item(key))
Next
Return somestring
End Function
<Extension()>
Public Function germanCharacter(ByVal s As String) As String
Dim t = s
t = t.Replace("ä", "ae")
t = t.Replace("ö", "oe")
t = t.Replace("ü", "ue")
t = t.Replace("Ä", "Ae")
t = t.Replace("Ö", "Oe")
t = t.Replace("Ü", "Ue")
t = t.Replace("ß", "ss")
Return t
End Function
<Extension()>
Public Function japaneseCharacter(ByVal s As String) As String
Dim t = s
t = t.Replace("ヶ", "ケ")
Return t
End Function
<Extension()>
Public Function greekCharacter(ByVal s As String) As String
Dim t = s
t = t.Replace("ς", "σ")
t = t.Replace("ι", "ί")
Return t
End Function
<Extension()>
Public Function franceCharacter(ByVal s As String) As String
Dim t = s
t = t.Replace("œ", "oe")
Return t
End Function
<Extension()>
Public Function RemoveDiacritics(ByVal s As String) As String
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString()
End Function
<Extension()>
Public Function badcharacters(ByVal s As String) As String
Dim t = s
t = t.Replace(ChrW(8206), "")
Return t
End Function
<Extension()>
Public Function sanitizeUTF8_Unicode(ByVal str As String) As String
Return str.ToLower.removeDoubleSpaces.SpacetoDash.EncodeUrlLimited.straightenQuotes.RemoveDiacritics.greekCharacter.germanCharacter
End Function
I used the VBA code below to investigate strange strings.
I copied the "byers-street" line to cell D18 of an Excel worksheet and typed
call DsplInHex(Range("D18"))
into the Immediate window. The result was:Adding a line break and some spaces gives:
According to my Unicode book
200E
is aLeft-To-Right Mark
. I would be interested to know how you managed to add that character to your key.VB.NET is correct; these keys are different. Either MySQL deletes such characters or your transfer process deleted it. Either way, you need check your source data for funny characters.
Probably using different unicode code points for characters that look similar, e.g. hyphen-minus (- U+002D), en-dash (– U+2013), and em-dash (— U+2014) are three different characters that all look similar: - – —
Use the AscW() function to check each character.
EDIT:
As discussed in the comments below, use the System.Text.NormalizationForm namespace to determine which Unicode code points are considered to be equivalent characters.