Is there a function in vb.net that will tell us wh

2019-08-10 06:34发布

问题:

This question is similar to How to emulate MySQLs utf8_general_ci collation in PHP string comparisons but I want the function for vb.net rather than PhP.

Recently I make a lot of supposedly unique key.

Some of the keys are equivalent under UTF8 unicode collation.

For example, look at these 2 key:

byers-street-bistro__38.15_-79.07 byers-street-bistro‎__38.15_-79.07

If I paste that into front page, and look at the source code you'll see

byers-street-bistro__38.15_-79.07

byers-street-bistro‎__38.15_-79.07

Note: In stack overflow they still look different.

I know it's not the same. I guess even in stack exchange it doesn't show. Say I have 1 million such records and I want to test whether 2 string WILL be declared the same by MySQL UTF8 collation. I want to know that before uploading. How do I do that.

So vb.net think that those are different keys. When we created mysql query and upload that to database, the database complain it's the same key. Just one complain and the upload of 1 million databases will be stuck.

We don't even know what the hell is ‎? Where can we look that up anyway?

Anyway, I want a function that when given 2 strings will tell me whether they will count as the same or not.

If possible we want a function that convert strings into their most "standard" form.

For example, ‎ seems to encode nothing and the function would recoqnize all those nothing character and eliminate that.

Is there such thing?

So far this is what I do. I need something more comprehensive.

    Private Function StraightenQuotesReplacement() As Generic.Dictionary(Of String, String)
    Static replacement As Generic.Dictionary(Of String, String)
    If replacement Is Nothing Then
        replacement = New Generic.Dictionary(Of String, String)
        replacement.Add(ChrW(&H201C), """")
        replacement.Add(ChrW(&H201D), """")
        replacement.Add(ChrW(&H2018), "'")
        replacement.Add(ChrW(&H2019), "'")
    End If
    Return replacement
End Function

<Extension()>
Public Function straightenQuotes(ByVal somestring As String) As String
    For Each key In StraightenQuotesReplacement.Keys
        somestring = somestring.Replace(key, StraightenQuotesReplacement.Item(key))
    Next
    Return somestring
End Function

<Extension()>
Public Function germanCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ä", "ae")
    t = t.Replace("ö", "oe")
    t = t.Replace("ü", "ue")
    t = t.Replace("Ä", "Ae")
    t = t.Replace("Ö", "Oe")
    t = t.Replace("Ü", "Ue")
    t = t.Replace("ß", "ss")
    Return t
End Function
<Extension()>
Public Function japaneseCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ヶ", "ケ")
    Return t
End Function

<Extension()>
Public Function greekCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("ς", "σ")
    t = t.Replace("ι", "ί")

    Return t
End Function
<Extension()>
Public Function franceCharacter(ByVal s As String) As String
    Dim t = s
    t = t.Replace("œ", "oe")
    Return t
End Function

<Extension()>
Public Function RemoveDiacritics(ByVal s As String) As String
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

<Extension()>
Public Function badcharacters(ByVal s As String) As String
    Dim t = s
    t = t.Replace(ChrW(8206), "")
    Return t
End Function

<Extension()>
Public Function sanitizeUTF8_Unicode(ByVal str As String) As String
    Return str.ToLower.removeDoubleSpaces.SpacetoDash.EncodeUrlLimited.straightenQuotes.RemoveDiacritics.greekCharacter.germanCharacter
End Function

回答1:

Probably using different unicode code points for characters that look similar, e.g. hyphen-minus (- U+002D), en-dash (– U+2013), and em-dash (— U+2014) are three different characters that all look similar: - – —

Use the AscW() function to check each character.

EDIT:

As discussed in the comments below, use the System.Text.NormalizationForm namespace to determine which Unicode code points are considered to be equivalent characters.



回答2:

I used the VBA code below to investigate strange strings.

I copied the "byers-street" line to cell D18 of an Excel worksheet and typed call DsplInHex(Range("D18")) into the Immediate window. The result was:

62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 5F 33 38 2E 31 35 2D 37 39 2E 30 37 20 62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 200E 5F 33 38 2E 31 35 2D 37 39 2E 30 37 

Adding a line break and some spaces gives:

62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F      5F 33 38 2E 31 35 2D 37 39 2E 30 37 20 
62 79 65 72 73 2D 73 74 72 65 65 74 2D 62 69 73 74 72 6F 200E 5F 33 38 2E 31 35 2D 37 39 2E 30 37 

According to my Unicode book 200E is a Left-To-Right Mark. I would be interested to know how you managed to add that character to your key.

VB.NET is correct; these keys are different. Either MySQL deletes such characters or your transfer process deleted it. Either way, you need check your source data for funny characters.

Option Explicit
Public Sub DsplInHex(Stg As String)

  Dim Pos As Long

  For Pos = 1 To Len(Stg)
    Debug.Print Hex(AscW(Mid(Stg, Pos, 1))) & " ";
  Next
  Debug.Print

End Sub