Best way to generate proper markup for inserting i

2019-09-15 12:24发布

问题:

I was assigned to import a large amount of content from a certain database, which belongs to a proprietary CMS system, to a new installation of WordPress. After writing a nice PHP script to retrieve entries and insert them using the wp_insert_post() function, I'm now stuck with a problem.

What I want to do is to "filter" my input string, which is the source content, to fit the format used natively by WordPress when content is copy-pasted to the built-in editor. For instance, this is how it would look like:

<strong>UIR e OER</strong>

&nbsp;

Os verbos terminados em <strong>-uir</strong> e <strong>-oer</strong> terão as 2ª e 3ª pessoas do singular do presente do indicativo escritas com <strong>-i-</strong>:

<strong> </strong>

<strong>– tu possuis</strong>

<strong>– ele possui</strong>

<strong>– tu constróis</strong>

...

Now, this is how the original content is retrieved from the source database:

<p>&nbsp;<b style="line-height: 150%; text-align: center;"><span style="font-size:13.5pt;line-height:150%;  font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;;  mso-fareast-language:PT-BR">UIR e OER</span></b></p>  <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Times New Roman&quot;,&quot;serif&quot;;  mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">&nbsp;<o:p></o:p></span></p>  <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;  mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">Os verbos terminados em <b>-uir</b> e <b>-oer</b> ter&atilde;o as 2&ordf; e 3&ordf; pessoas do singular do presente do&nbsp;indicativo escritas com <b>-i-</b>:<b> <o:p></o:p></b></span></p>  <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;  mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">&nbsp;</span></b></p>  <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;  mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">- tu possuis<o:p></o:p></span></b></p>  <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;  mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">- ele possui<o:p></o:p></span></b></p>  <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;  mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">- tu constr&oacute;is<o:p></o:p></span></b></p>  

At first it seemed that wp_insert_post() would process it automatically, and it actually does some processing, however it is not enough.

This is how the content is being stored by the import script:

<p>&nbsp;<b style="line-height: 150%; text-align: center;"><span style="font-size:13.5pt;line-height:150%;
font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;;
mso-fareast-language:PT-BR">UIR e OER</span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Times New Roman&quot;,&quot;serif&quot;;
mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">&nbsp;<o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;
mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">Os verbos terminados em <b>-uir</b> e <b>-oer</b> ter&atilde;o as 2&ordf; e 3&ordf; pessoas do singular do presente do&nbsp;indicativo escritas com <b>-i-</b>:<b> <o:p></o:p></b></span></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;
mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">&nbsp;</span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;
mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">- tu possuis<o:p></o:p></span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;
mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">- ele possui<o:p></o:p></span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;
mso-fareast-font-family:&quot;Times New Roman&quot;;mso-fareast-language:PT-BR">- tu constr&oacute;is<o:p></o:p></span></b></p>

My first idea was to implement a function myself, based on preg_replace() and html_entity_decode(), however it would seem to me that there is a much more elegant solution. Is there?

Edit: To put it another way, does PHP - or WordPress itself - provide a way to process the content like TinyMCE (which is the WordPress built-in editor) does? Naturally I can't rely on TinyMCE itself because it's a JavaScript tool.

回答1:

In my recent project, we needed to do the same. We used the following approaches:

  1. preg_replace for the simplest tasks.
  2. DOMDocument. This is an excellent PHP tool for parsing HTML.
  3. (non-PHP) The main import was done with node. With a couple of necessary tweaks, wp-cli node module is an excellent tool for manipulating WordPress environments. Then, we could use cheeriojs for parsing and modifying HTML.