可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.

I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).

Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.

Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.

回答1:

HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).

For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.

If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

回答2:

The website http://word2cleanhtml.com/ does a good job on converting from Word. I'm using it in PHP by scrapping, to process some legacy HTML, and until now it's working pretty fine (the result is very clean <p>, <b> code). Of course, being an external service it's not good to use it in online processing like your case.

If you try it and it brings many 400 errors, try filtering the HTML with Tidy first.

回答3:

In my case, there was a pattern. The unwanted part always started with

<!-- [if gte mso 9]>

and ended by an

<![endif]-->

So my solution was to cut out everything before and after this block:

$array = explode("<!-", $string, 2);
$begin = $array[0];
$end=substr(strrchr($string,'[endif]-->'),10);
echo $begin.$end;

回答4:

In my case, this worked just fine:

$text = strip_tags($text, '<p><a><em><span>');

Rather than trying to pull out stuff you don't want such as embedded word xml, you can just specify you're allowed tags.

PHP to clean-up pasted Microsoft input

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

PHP to clean-up pasted Microsoft input

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮