I have a set of Word documents which I want to publish using a PHP tool I've written. I copy and paste the Word documents into a text box and then save them into MySQL using the PHP program. The problem I Have arises from all the non-standard characters that Word documents have, like curly quotes and ellipses ("..."). What I do at the moment is manually search and replace these kinds of things (and also foreign symbols such as e-acute) with either plain text or HTML entities (é ; etc) Is there a function in PHP I can call that will take the output of a Word document and convert everything that should be entities into entities, and other symbols that don't display properly in Firefox into symbols that do display.
Thanks!
htmlspecialchars() will get you a long way, but watch out because Word documents are messy.
I think that all these answers miss one vital point. Windows itself uses a windows flavour of latin1, so if you paste some special characters in (like asymetrical quotes) into a form on a windows machine and that gets sent to a unix (or anything non-muckrosoft) box (be that to a database or whatever) some of the characters do not get matched to anything the unix system comprehends, hence the confused and garbled characters. What this means is that even if you have a UTF-8 database, and use htmlentities, some nasties are still going to get through because they are characters the OS doesn't recognise - they aren't even part of UTF-8 - the are microsoft-only inventions. I would love to know of a slick solution - what I do is manually blacklist the character codes of the microsoft-only chars I have encountered with an (also manual) list of UTF-8 characters, do a str_replace for all of these, and THEN you can do whatever you want with them - iconv, htmlentities, save straight into an utf8 database, it matters not anymore.
My grasp on this all is a little shaky - check out http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for an excellent explanation which I have mutilated into short form above. - If someone has a better solution (surely there is one out there!) of how to PHPify what this article explains... I would love to hear it!
This has served me well in the past:
A better solution would be to ensure that your database is set-up to support UTF-8 characters. The additional characters available in the extended set should cover all the "non-standard" characters that you're talking about.
Otherwise, if you really must convert these characters into HTML entities, use htmlentities().
Here's a solution I cooked up for the problem with the non-portable windows character set. This replaces the offending almost-Latin-1 characters with their equivalent HTML entities.
It Works For MeTM