How do I convert Word smart quotes and em dashes i

2019-01-06 13:13发布

I have a form with a textarea. Users enter a block of text which is stored in a database.

Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€

What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?

I am working in PHP.

Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html

Some notes on my environment:

The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.

On those pages the smart quotes and emdashes appear as a diamond with question mark.

Solution:

Thanks again for the responses. The solution was twofold:

  1. Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
  2. Use htmlspecialchars() instead of htmlentities().

13条回答
小情绪 Triste *
2楼-- · 2019-01-06 13:34

You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.

查看更多
趁早两清
3楼-- · 2019-01-06 13:35

It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.

Here is some info on migrating your database to another character encoding, at least for a MySQL database.

查看更多
孤傲高冷的网名
4楼-- · 2019-01-06 13:36

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

查看更多
别忘想泡老子
5楼-- · 2019-01-06 13:36

This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.

What we do is force the text through iconv

// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);

The //IGNORE flag means that anything that can't be translated will be thrown away.

If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.

查看更多
相关推荐>>
6楼-- · 2019-01-06 13:38

If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...

You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():

$trans_tbl = false;

function htmlEncode($text) {

  global $trans_tbl;

  // create translation table once
  if(!$trans_tbl) {
    // start with the default set of conversions and add more.

    $trans_tbl = get_html_translation_table(HTML_ENTITIES); 

    $trans_tbl[chr(130)] = '‚';    // Single Low-9 Quotation Mark
    $trans_tbl[chr(131)] = 'ƒ';    // Latin Small Letter F With Hook
    $trans_tbl[chr(132)] = '„';    // Double Low-9 Quotation Mark
    $trans_tbl[chr(133)] = '…';    // Horizontal Ellipsis
    $trans_tbl[chr(134)] = '†';    // Dagger
    $trans_tbl[chr(135)] = '‡';    // Double Dagger
    $trans_tbl[chr(136)] = 'ˆ';    // Modifier Letter Circumflex Accent
    $trans_tbl[chr(137)] = '‰';    // Per Mille Sign
    $trans_tbl[chr(138)] = 'Š';    // Latin Capital Letter S With Caron
    $trans_tbl[chr(139)] = '‹';    // Single Left-Pointing Angle Quotation Mark
    $trans_tbl[chr(140)] = 'Œ';    // Latin Capital Ligature OE

    // smart single/ double quotes (from MS)
    $trans_tbl[chr(145)] = '‘'; 
    $trans_tbl[chr(146)] = '’'; 
    $trans_tbl[chr(147)] = '“'; 
    $trans_tbl[chr(148)] = '”'; 

    $trans_tbl[chr(149)] = '•';    // Bullet
    $trans_tbl[chr(150)] = '–';    // En Dash
    $trans_tbl[chr(151)] = '—';    // Em Dash
    $trans_tbl[chr(152)] = '˜';    // Small Tilde
    $trans_tbl[chr(153)] = '™';    // Trade Mark Sign
    $trans_tbl[chr(154)] = 'š';    // Latin Small Letter S With Caron
    $trans_tbl[chr(155)] = '›';    // Single Right-Pointing Angle Quotation Mark
    $trans_tbl[chr(156)] = 'œ';    // Latin Small Ligature OE
    $trans_tbl[chr(159)] = 'Ÿ';    // Latin Capital Letter Y With Diaeresis

    ksort($trans_tbl);
  }

  // escape HTML      
  return strtr($text, $trans_tbl); 
}
查看更多
一纸荒年 Trace。
7楼-- · 2019-01-06 13:43

the problem is on the mysql charset, I fixed my issues with this line of code.

mysql_set_charset('utf8',$link); 
查看更多
登录 后发表回答