php: Get plain text from html - simplehtmldom or p

2019-02-19 23:43发布

I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?

One pro for simplehtmldom is support of invalid html, is that sufficient in itself?

标签： php extract web-scraping text-extraction simple-html-dom

5条回答

等我变得足够好

2楼-- · 2019-02-20 00:12

You may also want to remove slashes stripslashes()

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2019-02-20 00:15

Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.

https://github.com/mtibben/html2text

Install using composer:

composer require html2text/html2text

Basic usage:

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

0人赞添加讨论(0) 举报

女痞

4楼-- · 2019-02-20 00:21

If you just want a plain text rendering of a page then strip_tags is faster and simpler. If you want to do any manipulation of the text during that process, however, simplehtmldom is going to serve you better in the long run.

0人赞添加讨论(0) 举报

贼婆χ

5楼-- · 2019-02-20 00:23

You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks

You would also be able to filter text from elements that aren't displayed (inline style=display:none)

That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task

0人赞添加讨论(0) 举报

Fickle 薄情

6楼-- · 2019-02-20 00:24

strip_tags is sufficient for that.

0人赞添加讨论(0) 举报

php: Get plain text from html - simplehtmldom or p

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间