php: Get plain text from html - simplehtmldom or p

2019-02-19 23:43发布

I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?

One pro for simplehtmldom is support of invalid html, is that sufficient in itself?

5条回答
等我变得足够好
2楼-- · 2019-02-20 00:12

You may also want to remove slashes stripslashes()

查看更多
叛逆
3楼-- · 2019-02-20 00:15

Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.

https://github.com/mtibben/html2text

Install using composer:

composer require html2text/html2text

Basic usage:

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"
查看更多
女痞
4楼-- · 2019-02-20 00:21

If you just want a plain text rendering of a page then strip_tags is faster and simpler. If you want to do any manipulation of the text during that process, however, simplehtmldom is going to serve you better in the long run.

查看更多
贼婆χ
5楼-- · 2019-02-20 00:23

You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks

You would also be able to filter text from elements that aren't displayed (inline style=display:none)

That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task

查看更多
Fickle 薄情
6楼-- · 2019-02-20 00:24

strip_tags is sufficient for that.

查看更多
登录 后发表回答