How to implement a web scraper in PHP? [closed]

2019-01-01 14:45发布

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

15条回答
伤终究还是伤i
2楼-- · 2019-01-01 15:08

Scraper class from my framework:

<?php

/*
    Example:

    $site = $this->load->cls('scraper', 'http://www.anysite.com');
    $excss = $site->getExternalCSS();
    $incss = $site->getInternalCSS();
    $ids = $site->getIds();
    $classes = $site->getClasses();
    $spans = $site->getSpans(); 

    print '<pre>';
    print_r($excss);
    print_r($incss);
    print_r($ids);
    print_r($classes);
    print_r($spans);        

*/

class scraper
{
    private $url = '';

    public function __construct($url)
    {
        $this->url = file_get_contents("$url");
    }

    public function getInternalCSS()
    {
        $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getExternalCSS()
    {
        $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getIds()
    {
        $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getClasses()
    {
        $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getSpans(){
        $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

}
?>
查看更多
初与友歌
3楼-- · 2019-01-01 15:09

file_get_contents() can take a remote URL and give you the source. You can then use regular expressions (with the Perl-compatible functions) to grab what you need.

Out of curiosity, what are you trying to scrape?

查看更多
梦寄多情
4楼-- · 2019-01-01 15:09

The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.

查看更多
看风景的人
5楼-- · 2019-01-01 15:11

Nice PHP web scraping ebook here:

https://leanpub.com/web-scraping

查看更多
无与为乐者.
6楼-- · 2019-01-01 15:12

I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?

查看更多
怪性笑人.
7楼-- · 2019-01-01 15:14

There is a Book "Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" on this topic - see a review here

PHP-Architect covered it in a well written article in the December 2007 Issue by Matthew Turland

查看更多
登录 后发表回答