I use TinyMCE to allow minimal formatting of text within my site. From the HTML that's produced, I'd like to convert it to plain text for e-mail. I've been using a class called html2text, but it's really lacking in UTF-8 support, among other things. I do, however, like that it maps certain HTML tags to plain text formatting — like putting underscores around text that previously had <i> tags in the HTML.
Does anyone use a similar approach to converting HTML to plain text in PHP? And if so: Do you recommend any third-party classes that I can use? Or how do you best tackle this issue?
If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...
html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like
'
htmlspecialchars_decode converts things like&
html_entity_decode converts things like'<
and strip_tags removes any HTML tags left over.If you don't want to strip the tags completely and keep the content inside the tags, you can use the
DOMDocument
and extract thetextContent
of the root node like this:One advantage of this approach is that it does not require any external packages.
here is another solution:
For other variations of sanitization functions, see:
https://RunForgithub.com/tazotodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php
You can use lynx with -stdin and -dump options to achieve that:
Converting from HTML to text using a DOMDocument is a viable solution. Consider HTML2Text, which requires PHP5:
Regarding UTF-8, the write-up on the "howto" page states:
The author provides several approaches to solving this and states that version 2 of HTML2Text (using DOMDocument) has UTF-8 support.
Note the restrictions for commercial use.
Markdownify converts HTML to Markdown, a plain-text formatting system used on this very site.