I was assigned to import a large amount of content from a certain database, which belongs to a proprietary CMS system, to a new installation of WordPress. After writing a nice PHP script to retrieve entries and insert them using the wp_insert_post()
function, I'm now stuck with a problem.
What I want to do is to "filter" my input string, which is the source content, to fit the format used natively by WordPress when content is copy-pasted to the built-in editor. For instance, this is how it would look like:
<strong>UIR e OER</strong>
Os verbos terminados em <strong>-uir</strong> e <strong>-oer</strong> terão as 2ª e 3ª pessoas do singular do presente do indicativo escritas com <strong>-i-</strong>:
<strong> </strong>
<strong>– tu possuis</strong>
<strong>– ele possui</strong>
<strong>– tu constróis</strong>
...
Now, this is how the original content is retrieved from the source database:
<p> <b style="line-height: 150%; text-align: center;"><span style="font-size:13.5pt;line-height:150%; font-family:"Arial","sans-serif";mso-fareast-font-family:"Times New Roman"; mso-fareast-language:PT-BR">UIR e OER</span></b></p> <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR"> <o:p></o:p></span></p> <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif"; mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">Os verbos terminados em <b>-uir</b> e <b>-oer</b> terão as 2ª e 3ª pessoas do singular do presente do indicativo escritas com <b>-i-</b>:<b> <o:p></o:p></b></span></p> <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif"; mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR"> </span></b></p> <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif"; mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">- tu possuis<o:p></o:p></span></b></p> <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif"; mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">- ele possui<o:p></o:p></span></b></p> <p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif"; mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">- tu constróis<o:p></o:p></span></b></p>
At first it seemed that wp_insert_post()
would process it automatically, and it actually does some processing, however it is not enough.
This is how the content is being stored by the import script:
<p> <b style="line-height: 150%; text-align: center;"><span style="font-size:13.5pt;line-height:150%;
font-family:"Arial","sans-serif";mso-fareast-font-family:"Times New Roman";
mso-fareast-language:PT-BR">UIR e OER</span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:"Times New Roman","serif";
mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR"> <o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif";
mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">Os verbos terminados em <b>-uir</b> e <b>-oer</b> terão as 2ª e 3ª pessoas do singular do presente do indicativo escritas com <b>-i-</b>:<b> <o:p></o:p></b></span></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif";
mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR"> </span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif";
mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">- tu possuis<o:p></o:p></span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif";
mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">- ele possui<o:p></o:p></span></b></p>
<p class="MsoNormal" style="mso-margin-bottom-alt:auto;line-height:150%"><b><span style="font-size:12.0pt;line-height:150%;font-family:"Arial","sans-serif";
mso-fareast-font-family:"Times New Roman";mso-fareast-language:PT-BR">- tu constróis<o:p></o:p></span></b></p>
My first idea was to implement a function myself, based on preg_replace()
and html_entity_decode()
, however it would seem to me that there is a much more elegant solution. Is there?
Edit: To put it another way, does PHP - or WordPress itself - provide a way to process the content like TinyMCE (which is the WordPress built-in editor) does? Naturally I can't rely on TinyMCE itself because it's a JavaScript tool.