Parsing HTML in Cakephp

2019-08-08 20:18发布

I started building a web crawler in CakePHP 2.2. The pages, the script is crawling is HTML pages, and I need to parse them, to get my values.

Have tried some different solutions, and looked on some open source things aswell, but not sure what the best way is to do this.

DomDocument::loadHTML() - Looks like this is the solution but not 100% sure.
Regular Expression - A bit hard to maintain
Simple HTMLDom - http://electrokami.com/coding/simple-html-dom-baked-cakephp-component (Made for Cake 1.3, and the code it self, yeah I don't like it - and got serious memory leak(s))

To figure out, which method I should use, I need your help.

标签： html parsing web-crawler php-5.3 cakephp-2.2

1条回答

Emotional °昔

2楼-- · 2019-08-08 20:34

DomDocument is your best choice. There are some decent examples in php.net documentation for this module. If you can use other language such as ruby I have very good experience with hpricot, a jQuery like library for parsing html.

This question is related to Robust and Mature HTML Parser for PHP

0人赞添加讨论(0) 举报

Parsing HTML in Cakephp

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间