How to implement a web scraper in PHP? [closed]

2019-01-01 14:45发布

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

15条回答
萌妹纸的霸气范
2楼-- · 2019-01-01 14:56

ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.

查看更多
余欢
3楼-- · 2019-01-01 14:56

here is another one: a simple PHP Scraper without Regex.

查看更多
爱死公子算了
4楼-- · 2019-01-01 15:01

Scraping generally encompasses 3 steps:

  • first you GET or POST your request to a specified URL
  • next you receive the html that is returned as the response
  • finally you parse out of that html the text you'd like to scrape.

To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.

For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial

My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).

Usage:

$curl = new Curl(); $html = $curl->get("http://www.google.com");

// now, do your regex work against $html

PHP Class:



<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

查看更多
一个人的天荒地老
5楼-- · 2019-01-01 15:02

Here's an OK tutorial (link removed, see below) on web scraping using cURL and file_get_contents. Besure to read the next few parts as well.

(direct hyperlink removed due to malware warnings)

http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php

查看更多
不再属于我。
6楼-- · 2019-01-01 15:08

I'd like to recommend this class I recently came across. Simple HTML DOM Parser

查看更多
孤独寂梦人
7楼-- · 2019-01-01 15:08

Scraping can be pretty complex, depending on what you want to do. Have a read of this tutorial series on The Basics Of Writing A Scraper In PHP and see if you can get to grips with it.

You can use similar methods to automate form sign ups, logins, even fake clicking on Ads! The main limitations with using CURL though are that it doesn't support using javascript, so if you are trying to scrape a site that uses AJAX for pagination for example it can become a little tricky...but again there are ways around that!

查看更多
登录 后发表回答