C# How to replace Microsoft's Smart Quotes wit

2019-01-16 18:50发布

My post below asked what the curly quotation marks were and why my app wouldn't work with them, my question now is how can I replace them when my program comes across them, how can I do this in C#? Are they special characters?

curly-quotation-marks-vs-square-quotation-marks-what-gives

Thanks

11条回答
女痞
2楼-- · 2019-01-16 19:18

I have a whole great big... program... that does precisely this. You can rip out the script and use it at your leasure. It does all sorts of replacements, and is located at http://bitbucket.org/nesteruk/typografix

查看更多
Evening l夕情丶
3楼-- · 2019-01-16 19:19

Note that what you have is inherently a corrupt CSV file. Indiscriminately replacing all typographer's quotes with straight quotes won't necessarily fix your file. For all you know, some of the typographer's quotes were supposed to be there, as part of a field's value. Replacing them with straight quotes might not leave you with a valid CSV file, either.

I don't think there is an algorithmic way to fix a file that is corrupt in the way you describe. Your time might be better spent investigating how you come to have such invalid files in the first place, and then putting a stop to it. Is someone using Word to edit your data files, for instance?

查看更多
SAY GOODBYE
4楼-- · 2019-01-16 19:23

To extend on Nick van Esch's popular answer, here is the code with the names of the characters in the comments.

if (buffer.IndexOf('\u2013') > -1) buffer = buffer.Replace('\u2013', '-'); // en dash
if (buffer.IndexOf('\u2014') > -1) buffer = buffer.Replace('\u2014', '-'); // em dash
if (buffer.IndexOf('\u2015') > -1) buffer = buffer.Replace('\u2015', '-'); // horizontal bar
if (buffer.IndexOf('\u2017') > -1) buffer = buffer.Replace('\u2017', '_'); // double low line
if (buffer.IndexOf('\u2018') > -1) buffer = buffer.Replace('\u2018', '\''); // left single quotation mark
if (buffer.IndexOf('\u2019') > -1) buffer = buffer.Replace('\u2019', '\''); // right single quotation mark
if (buffer.IndexOf('\u201a') > -1) buffer = buffer.Replace('\u201a', ','); // single low-9 quotation mark
if (buffer.IndexOf('\u201b') > -1) buffer = buffer.Replace('\u201b', '\''); // single high-reversed-9 quotation mark
if (buffer.IndexOf('\u201c') > -1) buffer = buffer.Replace('\u201c', '\"'); // left double quotation mark
if (buffer.IndexOf('\u201d') > -1) buffer = buffer.Replace('\u201d', '\"'); // right double quotation mark
if (buffer.IndexOf('\u201e') > -1) buffer = buffer.Replace('\u201e', '\"'); // double low-9 quotation mark
if (buffer.IndexOf('\u2026') > -1) buffer = buffer.Replace("\u2026", "..."); // horizontal ellipsis
if (buffer.IndexOf('\u2032') > -1) buffer = buffer.Replace('\u2032', '\''); // prime
if (buffer.IndexOf('\u2033') > -1) buffer = buffer.Replace('\u2033', '\"'); // double prime
查看更多
我命由我不由天
5楼-- · 2019-01-16 19:24

When I encountered this problem I wrote an extension method to the String class in C#.

public static class StringExtensions
{
    public static string StripIncompatableQuotes(this string s)
    {
        if (!string.IsNullOrEmpty(s))
            return s.Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"');
        else
            return s;
    }
}

This simply replaces the silly 'smart quotes' with normal quotes.

[EDIT] Fixed to also support replacement of 'double smart quotes'.

查看更多
可以哭但决不认输i
6楼-- · 2019-01-16 19:24

Using Nick and Barbara's answers, here is example code with performance stats for 1,000,000 loops on my machine:

input = "shmB6BhLe0gdGU8OxYykZ21vuxLjBo5I1ZTJjxWfyRTTlqQlgz0yUtPu8iNCCcsx78EPsObiPkCpRT8nqRtvM3Bku1f9nStmigaw";
input.Replace('\u2013', '-'); // en dash
input.Replace('\u2014', '-'); // em dash
input.Replace('\u2015', '-'); // horizontal bar
input.Replace('\u2017', '_'); // double low line
input.Replace('\u2018', '\''); // left single quotation mark
input.Replace('\u2019', '\''); // right single quotation mark
input.Replace('\u201a', ','); // single low-9 quotation mark
input.Replace('\u201b', '\''); // single high-reversed-9 quotation mark
input.Replace('\u201c', '\"'); // left double quotation mark
input.Replace('\u201d', '\"'); // right double quotation mark
input.Replace('\u201e', '\"'); // double low-9 quotation mark
input.Replace("\u2026", "..."); // horizontal ellipsis
input.Replace('\u2032', '\''); // prime
input.Replace('\u2033', '\"'); // double prime

Time: 958.1011 milliseconds

input = "shmB6BhLe0gdGU8OxYykZ21vuxLjBo5I1ZTJjxWfyRTTlqQlgz0yUtPu8iNCCcsx78EPsObiPkCpRT8nqRtvM3Bku1f9nStmigaw";
var inputArray = input.ToCharArray();
for (int i = 0; i < inputArray.Length; i++)
{
    switch (inputArray[i])
    {
        case '\u2013':
            inputArray[i] = '-';
            break;
        // en dash
        case '\u2014':
            inputArray[i] = '-';
            break;
        // em dash
        case '\u2015':
            inputArray[i] = '-';
            break;
        // horizontal bar
        case '\u2017':
            inputArray[i] = '_';
            break;
        // double low line
        case '\u2018':
            inputArray[i] = '\'';
            break;
        // left single quotation mark
        case '\u2019':
            inputArray[i] = '\'';
            break;
        // right single quotation mark
        case '\u201a':
            inputArray[i] = ',';
            break;
        // single low-9 quotation mark
        case '\u201b':
            inputArray[i] = '\'';
            break;
        // single high-reversed-9 quotation mark
        case '\u201c':
            inputArray[i] = '\"';
            break;
        // left double quotation mark
        case '\u201d':
            inputArray[i] = '\"';
            break;
        // right double quotation mark
        case '\u201e':
            inputArray[i] = '\"';
            break;
        // double low-9 quotation mark
        case '\u2026':
            inputArray[i] = '.';
            break;
        // horizontal ellipsis
        case '\u2032':
            inputArray[i] = '\'';
            break;
        // prime
        case '\u2033':
            inputArray[i] = '\"';
            break;
        // double prime
    }
}
input = new string(inputArray);

Time: 362.0858 milliseconds

查看更多
▲ chillily
7楼-- · 2019-01-16 19:31

A more extensive listing of problematic word characters

if (buffer.IndexOf('\u2013') > -1) buffer = buffer.Replace('\u2013', '-');
if (buffer.IndexOf('\u2014') > -1) buffer = buffer.Replace('\u2014', '-');
if (buffer.IndexOf('\u2015') > -1) buffer = buffer.Replace('\u2015', '-');
if (buffer.IndexOf('\u2017') > -1) buffer = buffer.Replace('\u2017', '_');
if (buffer.IndexOf('\u2018') > -1) buffer = buffer.Replace('\u2018', '\'');
if (buffer.IndexOf('\u2019') > -1) buffer = buffer.Replace('\u2019', '\'');
if (buffer.IndexOf('\u201a') > -1) buffer = buffer.Replace('\u201a', ',');
if (buffer.IndexOf('\u201b') > -1) buffer = buffer.Replace('\u201b', '\'');
if (buffer.IndexOf('\u201c') > -1) buffer = buffer.Replace('\u201c', '\"');
if (buffer.IndexOf('\u201d') > -1) buffer = buffer.Replace('\u201d', '\"');
if (buffer.IndexOf('\u201e') > -1) buffer = buffer.Replace('\u201e', '\"');
if (buffer.IndexOf('\u2026') > -1) buffer = buffer.Replace("\u2026", "...");
if (buffer.IndexOf('\u2032') > -1) buffer = buffer.Replace('\u2032', '\'');
if (buffer.IndexOf('\u2033') > -1) buffer = buffer.Replace('\u2033', '\"');
查看更多
登录 后发表回答