Converting a Word document into usable HTML in PHP

I have a set of Word documents which I want to publish using a PHP tool I've written. I copy and paste the Word documents into a text box and then save them into MySQL using the PHP program. The problem I Have arises from all the non-standard characters that Word documents have, like curly quotes and ellipses ("..."). What I do at the moment is manually search and replace these kinds of things (and also foreign symbols such as e-acute) with either plain text or HTML entities (&eacute ; etc) Is there a function in PHP I can call that will take the output of a Word document and convert everything that should be entities into entities, and other symbols that don't display properly in Firefox into symbols that do display.

Thanks!

标签： php ms-word

5条回答

We Are One

2楼-- · 2020-02-10 08:26

htmlspecialchars() will get you a long way, but watch out because Word documents are messy.

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2020-02-10 08:29

I think that all these answers miss one vital point. Windows itself uses a windows flavour of latin1, so if you paste some special characters in (like asymetrical quotes) into a form on a windows machine and that gets sent to a unix (or anything non-muckrosoft) box (be that to a database or whatever) some of the characters do not get matched to anything the unix system comprehends, hence the confused and garbled characters. What this means is that even if you have a UTF-8 database, and use htmlentities, some nasties are still going to get through because they are characters the OS doesn't recognise - they aren't even part of UTF-8 - the are microsoft-only inventions. I would love to know of a slick solution - what I do is manually blacklist the character codes of the microsoft-only chars I have encountered with an (also manual) list of UTF-8 characters, do a str_replace for all of these, and THEN you can do whatever you want with them - iconv, htmlentities, save straight into an utf8 database, it matters not anymore.

My grasp on this all is a little shaky - check out http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for an excellent explanation which I have mutilated into short form above. - If someone has a better solution (surely there is one out there!) of how to PHPify what this article explains... I would love to hear it!

0人赞添加讨论(0) 举报

神经病院院长

4楼-- · 2020-02-10 08:38

This has served me well in the past:

$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')

0人赞添加讨论(0) 举报

够拽才男人

5楼-- · 2020-02-10 08:42

A better solution would be to ensure that your database is set-up to support UTF-8 characters. The additional characters available in the extended set should cover all the "non-standard" characters that you're talking about.

Otherwise, if you really must convert these characters into HTML entities, use htmlentities().

0人赞添加讨论(0) 举报

对你真心纯属浪费

6楼-- · 2020-02-10 08:42

Here's a solution I cooked up for the problem with the non-portable windows character set. This replaces the offending almost-Latin-1 characters with their equivalent HTML entities.

$translation=array(
    // reference from http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    "\x82" => "&#8218;",
    "\x83" => "&#402;",
    "\x84" => "&#8222;",
    "\x85" => "&#8230;",
    "\x86" => "&#8224;",
    "\x87" => "&#8225;",
    "\x88" => "&#710;",
    "\x89" => "&#8240;",
    "\x8a" => "&#352;",
    "\x8b" => "&#8249;",
    "\x8c" => "&#338;",
    "\x91" => "&#8216;",
    "\x92" => "&#8217;",
    "\x93" => "&#8220;",
    "\x94" => "&#8221;",
    "\x95" => "&#8226;",
    "\x96" => "&#8211;",
    "\x97" => "&#8212;",
    "\x98" => "&#732;",
    "\x99" => "&#8482;",
    "\x9a" => "&#353;",
    "\x9b" => "&#8250;",
    "\x9c" => "&#339;",
    "\x9f" => "&#376;",
);    
return str_replace(array_keys($translation),array_values($translation),$input);

It Works For Me^TM

0人赞添加讨论(0) 举报

Converting a Word document into usable HTML in PHP

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间