What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.
here is another one: a simple PHP Scraper without Regex.
Scraping generally encompasses 3 steps:
To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.
For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial
My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).
Usage:
PHP Class:
Here's an OK tutorial (link removed, see below) on web scraping using
cURL
andfile_get_contents
. Besure to read the next few parts as well.(direct hyperlink removed due to malware warnings)
http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php
I'd like to recommend this class I recently came across. Simple HTML DOM Parser
Scraping can be pretty complex, depending on what you want to do. Have a read of this tutorial series on The Basics Of Writing A Scraper In PHP and see if you can get to grips with it.
You can use similar methods to automate form sign ups, logins, even fake clicking on Ads! The main limitations with using CURL though are that it doesn't support using javascript, so if you are trying to scrape a site that uses AJAX for pagination for example it can become a little tricky...but again there are ways around that!