问题:

I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?

One pro for simplehtmldom is support of invalid html, is that sufficient in itself?

回答1:

You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks

You would also be able to filter text from elements that aren't displayed (inline style=display:none)

That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task

回答2:

strip_tags is sufficient for that.

回答3:

Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.

https://github.com/mtibben/html2text

Install using composer:

composer require html2text/html2text

Basic usage:

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

回答4:

If you just want a plain text rendering of a page then strip_tags is faster and simpler. If you want to do any manipulation of the text during that process, however, simplehtmldom is going to serve you better in the long run.