I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?
One pro for simplehtmldom is support of invalid html, is that sufficient in itself?
I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?
One pro for simplehtmldom is support of invalid html, is that sufficient in itself?
You may also want to remove slashes stripslashes()
Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.
https://github.com/mtibben/html2text
Install using composer:
Basic usage:
If you just want a plain text rendering of a page then strip_tags is faster and simpler. If you want to do any manipulation of the text during that process, however, simplehtmldom is going to serve you better in the long run.
You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks
You would also be able to filter text from elements that aren't displayed (inline style=display:none)
That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task
strip_tags
is sufficient for that.