I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€
What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?
I am working in PHP.
Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html
Some notes on my environment:
The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.
On those pages the smart quotes and emdashes appear as a diamond with question mark.
Solution:
Thanks again for the responses. The solution was twofold:
- Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
- Use
htmlspecialchars()
instead ofhtmlentities()
.
You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.
It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.
Here is some info on migrating your database to another character encoding, at least for a MySQL database.
This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html
This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.
What we do is force the text through
iconv
The
//IGNORE
flag means that anything that can't be translated will be thrown away.If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...
You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():
the problem is on the mysql charset, I fixed my issues with this line of code.