Length of substring matched by culture-sensitive S

2019-02-21 12:45发布

问题:

I tried writing a culture-aware string replacement method:

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

However, it chokes on Unicode combining characters:

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

To fix my code, I need to know that in the second example, String.IndexOf matched only one character (é) even though it searched for two (e\u0301). Similarly, I need to know that in the third example, String.IndexOf matched two characters (e\u0301) even though it only searched for one (é).

How can I determine the actual length of the substring matched by String.IndexOf?

NOTE: Performing Unicode normalization on text and oldValue (as suggested by James Keesey) would accommodate combining characters, but ligatures would still be a problem:

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief

回答1:

You will need to directly call FindNLSString or FindNLSStringEx yourself. String.IndexOf uses FindNLSStringEx but all the information you need is available in FindNLSString.

Here is an example of how to rewrite your Replace method that works against your test cases. Note that I am using the current user locale read up the API documentation if you want to use the system locale or provide your own. I am also passing in 0 for the flags which means it will use the default string comparison options for the locale, again the documentation can help you provide different options.

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}


回答2:

I spoke too soon (and had never seen this method before) but there is an alternative. You can use the StringInfo.ParseCombiningCharacters() method to get the start of each actual character and use that to determine the length of the string to replace.


You will need to normalize both strings before you do the Index call. This will make sure that the source and target strings are the same length.

See the String.Normalize() reference page which describes this exact problem.



回答3:

Using the following methods works for your examples. It works by comparing values until it finds how many characters are needed in the source string to equal the oldValue, and using that instead of simply oldValue.Length.

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    if (index >= 0)
        return text.Substring(0, index) + newValue +
                 text.Substring(index + LengthInString(text, oldValue, index));
    else
        return text;
}
static int LengthInString(string text, string oldValue, int index)
{
    for (int length = 1; length <= text.Length - index; length++)
        if (string.Equals(text.Substring(index, length), oldValue,
                                            StringComparison.CurrentCulture))
            return length;
    throw new Exception("Oops!");
}