Considering you only want to replace a few specific and well identified characters, I would go for str_replace
with an array: you obviously don't need the heavy artillery regex will bring you ;-)
And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.
The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP
And the associated code (quoting that page):
function convert_smart_quotes($string)
{
$search = array(chr(145),
chr(146),
chr(147),
chr(148),
chr(151));
$replace = array("'",
"'",
'"',
'"',
'-');
return str_replace($search, $replace, $string);
}
(I don't have Microsoft Word on this computer, so I can't test by myself)
I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...
Your Microsoft-encoded quotes are the probably the typographic quotation marks. You can simply replace them with str_replace
if you know the encoding of the string in that you want to replace them.
Here’s an example for UTF-8 but using a single mapping array with strtr
:
$quotes = array(
"\xC2\xAB" => '"', // « (U+00AB) in UTF-8
"\xC2\xBB" => '"', // » (U+00BB) in UTF-8
"\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
"\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
"\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
"\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8
"\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
"\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
"\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
"\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8
"\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
"\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
);
$str = strtr($str, $quotes);
If you’re need another encoding, you can use mb_convert_encoding
to convert the keys.
If like me you arrive here with an enormous range of broken ASCII / Microsoft Word characters that are doing weird things to your CMS or RTE and iconv isn't working, then this mad function might just be for you.
Make sure your encoding is UTF-8 when you save this function to a file.
<?php
/**
* fixMSWord
*
* Replace ASCII chars with UTF-8. Note there are ASCII characters that don't
* correctly map and will be replaced by spaces.
*
* @author Robin Cafolla
* @date 2013-03-22
*/
function fixMSWord($string) {
$map = Array(
'33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*',
'43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4',
'53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>',
'63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H',
'73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R',
'83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\',
'93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f',
'103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p',
'113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z',
'123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '€', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"',
'133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ',
'143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~',
'153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢',
'163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬',
'173'=> '', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶',
'183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À',
'193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê',
'203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô',
'213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ',
'223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è',
'233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò',
'243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü',
'253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ'
);
$search = Array();
$replace = Array();
foreach ($map as $s => $r) {
$search[] = chr((int)$s);
$replace[] = $r;
}
return str_replace($search, $replace, $string);
}
We used the following. It deals with a few more special characters.
$text = str_replace(chr(130), ',', $text); // Baseline single quote
$text = str_replace(chr(132), '"', $text); // Baseline double quote
$text = str_replace(chr(133), '...', $text); // Ellipsis
$text = str_replace(chr(145), "'", $text); // Left single quote
$text = str_replace(chr(146), "'", $text); // Right single quote
$text = str_replace(chr(147), '"', $text); // Left double quote
$text = str_replace(chr(148), '"', $text); // Right double quote
$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');
Every single one of the previous answers except for Gumbo's will mangle Unicode strings:
echo convert_smart_quotes("This is Yi: ꑑ. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad.");
Results in:
This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.
The iconv:
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
Results in:
PHP Notice: iconv(): Detected an illegal character in input string in php shell code on line 1
You can change it to //IGNORE
, which will remove the characters, but not translate them.
This is the best way to replace Microsoft quotes encoded in CP1252. If they are in Unicode and you need to replace them, use Gumbo's answer:
function convert_cp1252_to_ascii($input, $default = '') {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Use the search/replace arrays if a character needs to be replaced with
* something other than its Unicode equivalent.
*/
$replace = array(
128 => "E", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
129 => "", // UNDEFINED
130 => ",", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
131 => "f", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
132 => ",,", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
133 => "...", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
134 => "t", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
135 => "T", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
136 => "^", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
137 => "%", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
138 => "S", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
139 => "<", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
140 => "OE", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
141 => "", // UNDEFINED
142 => "Z", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
143 => "", // UNDEFINED
144 => "", // UNDEFINED
145 => "'", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
146 => "'", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
147 => "\"", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
148 => "\"", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
149 => "*", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
150 => "-", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
151 => "--", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
152 => "~", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
153 => "TM", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
154 => "s", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
155 => ">", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
156 => "oe", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
157 => "", // UNDEFINED
158 => "z", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
159 => "Y", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
);
$find = array();
foreach (array_keys($replace) as $key) {
$find[] = chr($key);
}
$input = str_replace($find, array_values($replace), $input);
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
}
return $input;
}
Taken from this answer, with some modifications. If you want to control over what you find/replace, use that function.