Answer 1:

面向对象的答案

我们实现尽可能的的以前的答案在一类叫做Browser应该提供正常的导航功能。

那么我们应该能够把特定网站的代码，在非常简单的形式，在一个新的派生类中，我们打电话，说， FooBrowser ，其执行刮网站Foo 。

该类导出浏览器必须提供一些特定于站点的功能，诸如path()函数允许商店站点特定的信息，例如

function path($basename) {
    return '/var/tmp/www.foo.bar/' . $basename;
}

abstract class Browser
{
    private $options = [];
    private $state   = [];
    protected $cookies;

    abstract protected function path($basename);

    public function __construct($site, $options = []) {
        $this->cookies   = $this->path('cookies');
        $this->options  = array_merge(
            [
                'site'      => $site,
                'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper',
                'waitTime'  => 250000,
            ],
            $options
        );
        $this->state = [
            'referer' => '/',
            'url'     => '',
            'curl'    => '',
        ];
        $this->__wakeup();
    }

    /**
     * Reactivates after sleep (e.g. in session) or creation
     */
    public function __wakeup() {
        $this->state['curl'] = curl_init();
        $this->config([
            CURLOPT_USERAGENT       => $this->options['userAgent'],
            CURLOPT_ENCODING        => '',
            CURLOPT_NOBODY          => false,
            // ...retrieving the body...
            CURLOPT_BINARYTRANSFER  => true,
            // ...as binary...
            CURLOPT_RETURNTRANSFER  => true,
            // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => true,
            // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,
            // ...reasonably...
            CURLOPT_COOKIEFILE      => $this->cookies,
            // Save these cookies
            CURLOPT_COOKIEJAR       => $this->cookies,
            // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,
            // Seconds
            CURLOPT_TIMEOUT         => 300,
            // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,
            // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,
        ]);
    }

    /**
     * Imports an options array.
     *
     * @param array $opts
     * @throws DetailedError
     */
    private function config(array $opts = []) {
        foreach ($opts as $key => $value) {
            if (true !== curl_setopt($this->state['curl'], $key, $value)) {
                throw new \Exception('Could not set cURL option');
            }
        }
    }

    private function perform($url) {
        $this->state['referer'] = $this->state['url'];
        $this->state['url'] = $url;
        $this->config([
            CURLOPT_URL     => $this->options['site'] . $this->state['url'],
            CURLOPT_REFERER => $this->options['site'] . $this->state['referer'],
        ]);
        $response = curl_exec($this->state['curl']);
        // Should we ever want to randomize waitTime, do so here.
        usleep($this->options['waitTime']);

        return $response;
    }

    /**
     * Returns a configuration option.
     * @param string $key       configuration key name
     * @param string $value     value to set
     * @return mixed
     */
    protected function option($key, $value = '__DEFAULT__') {
        $curr   = $this->options[$key];
        if ('__DEFAULT__' !== $value) {
            $this->options[$key]    = $value;
        }
        return $curr;
    }

    /**
     * Performs a POST.
     *
     * @param $url
     * @param $fields
     * @return mixed
     */
    public function post($url, array $fields) {
        $this->config([
            CURLOPT_POST       => true,
            CURLOPT_POSTFIELDS => http_build_query($fields),
        ]);
        return $this->perform($url);
    }

    /**
     * Performs a GET.
     *
     * @param       $url
     * @param array $fields
     * @return mixed
     */
    public function get($url, array $fields = []) {
        $this->config([ CURLOPT_POST => false ]);
        if (empty($fields)) {
            $query = '';
        } else {
            $query = '?' . http_build_query($fields);
        }
        return $this->perform($url . $query);
    }
}

现在刮FooSite：

/* WWW_FOO_COM requires username and password to construct */

class WWW_FOO_COM_Browser extends Browser
{
    private $loggedIn   = false;

    public function __construct($username, $password) {
        parent::__construct('http://www.foo.bar.baz', [
            'username'  => $username,
            'password'  => $password,
            'waitTime'  => 250000,
            'userAgent' => 'FooScraper',
            'cache'     => true
        ]);
        // Open the session
        $this->get('/');
        // Navigate to the login page
        $this->get('/login.do');
    }

    /**
     * Perform login.
     */
    public function login() {
        $response = $this->post(
            '/ajax/loginPerform',
            [
                'j_un'    => $this->option('username'),
                'j_pw'    => $this->option('password'),
            ]
        );
        // TODO: verify that response is OK.
        // if (!strstr($response, "Welcome " . $this->option('username'))
        //     throw new \Exception("Bad username or password")
        $this->loggedIn = true;
        return true;
    }

    public function scrape($entry) {
        // We could implement caching to avoid scraping the same entry
        // too often. Save $data into path("entry-" . md5($entry))
        // and verify the filemtime of said file, is it newer than time()
        // minus, say, 86400 seconds? If yes, return file_get_content and
        // leave remote site alone.
        $data = $this->get(
            '/foobars/baz.do',
            [
                'ticker' => $entry
            ]
        );
        return $data;
    }

现在实际的拼抢代码如下：

    $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword');
    if (!$scraper->login()) {
        throw new \Exception("bad user or pass");
    }
    foreach ($entries as $entry) {
        $html = $scraper->scrape($entry);
        // Parse HTML
    }

强制性通知：使用合适解析器从原始的HTML获取数据。

Answer 2:

你可以这样做，在卷曲而不需要外部“模拟器”。

下面的代码检索页面到一个PHP变量来进行解析。

脚本

有一个页面（让我们HOME称呼它）打开的会话。服务器端，如果是在PHP中，是调用一个（实际上任何一个） session_start()的第一次。在其他语言中，你需要一个特定的网页，会做所有的会话建立。从客户端它的供应会话ID cookie的网页。在PHP中，所有sessioned做网页; 在其他语言中的着陆页会做，所有的人将检查cookie是否存在，如果不存在，而不是创建会话，将下降你HOME。

有一个页面（登录），生成的登录表单，并增加了一个重要的信息给会话 - “此用户已登录”。在下面的代码，这是要求会话ID的页面。

最后有N个网页，好吃的东西是擦伤居住。

所以，我们想打HOME，然后登录，然后好东西此起彼伏。在PHP（和其他语言实际上），再次，家庭和LOGIN很可能是在同一个页面。或者所有页面可能共享相同的地址，例如在单页的应用程序。

代码

    $url            = "the url generating the session ID";
    $next_url       = "the url asking for session";

    $ch             = curl_init();
    curl_setopt($ch, CURLOPT_URL,    $url);
    // We do not authenticate, only access page to get a session going.
    // Change to False if it is not enough (you'll see that cookiefile
    // remains empty).
    curl_setopt($ch, CURLOPT_NOBODY, True);

    // You may want to change User-Agent here, too
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile");
    curl_setopt($ch, CURLOPT_COOKIEJAR,  "cookiefile");

    // Just in case
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $ret    = curl_exec($ch);

    // This page we retrieve, and scrape, with GET method
    foreach(array(
            CURLOPT_POST            => False,       // We GET...
            CURLOPT_NOBODY          => False,       // ...the body...
            CURLOPT_URL             => $next_url,   // ...of $next_url...
            CURLOPT_BINARYTRANSFER  => True,        // ...as binary...
            CURLOPT_RETURNTRANSFER  => True,        // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => True,        // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,           // ...reasonably...
            CURLOPT_REFERER         => $url,        // ...as if we came from $url...
            //CURLOPT_COOKIEFILE      => 'cookiefile', // Save these cookies
            //CURLOPT_COOKIEJAR       => 'cookiefile', // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,          // Seconds
            CURLOPT_TIMEOUT         => 300,         // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,       // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,          // 
            ) as $option => $value)
            if (!curl_setopt($ch, $option, $value))
                    die("could not set $option to " . serialize($value));

    $ret = curl_exec($ch);
    // Done; cleanup.
    curl_close($ch);

履行

首先我们要得到的登录页面。

我们使用一个特殊的用户代理自荐，为了既要识别（我们不想对抗站长）也给服务器欺骗我们发送的浏览器定制网站的特定版本。理想情况下，我们使用相同的用户代理，因为我们要使用调试网页的任何浏览器，再加上一个后缀谁检查，这是他们看到的是一个自动化的工具，使之清楚（见Halfer评论 ）。

    $ua = 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 (ROBOT)';
    $cookiefile = "cookiefile";
    $url1 = "the login url generating the session ID";

    $ch             = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url1);
    curl_setopt($ch, CURLOPT_USERAGENT,      $ua);
    curl_setopt($ch, CURLOPT_COOKIEFILE,     $cookiefile);
    curl_setopt($ch, CURLOPT_COOKIEJAR,      $cookiefile);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, True);
    curl_setopt($ch, CURLOPT_NOBODY,         False);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, True);
    $ret    = curl_exec($ch);

这将检索的页面要求用户/密码。通过检查该页面中，我们找到所需的字段（包括隐藏的），并且可以对其进行填充。在FORM标签告诉我们是否需要去与POST或GET。

我们可能要检查表单代码调整下面的操作，所以我们请卷曲返回页面内容，是为$ret ，并做返回页面正文。有时候， CURLOPT_NOBODY设置为True仍足以触发会话创建和饼干提交，如果是这样，它的速度更快。但是CURLOPT_NOBODY （“无体”）的工作原理是发出HEAD ，而不是一个请求， GET ; 有时HEAD请求不起作用，因为服务器将只反应一个完整的GET 。

相反检索身体这种方式，还可以使用真正的Firefox登录并嗅出形式的内容被张贴着萤火虫（或Chrome与Chrome的工具）; 有些网站会尝试填充/修改隐藏字段使用JavaScript，以便提交表单不会是你的HTML代码中看到的。

一个网站管理员谁希望自己的网站 不 刮可能会发送一个隐藏字段的时间戳。 一个人（不是一个太聪明的浏览器辅助-有办法告诉浏览器不聪明;在最坏的情况，每次更改用户名，并通过字段）需要至少三秒钟填写的表格。 卷曲脚本接受零。 当然，延迟可以模拟。 这是所有太极拳......

我们也可能希望在寻找的形式亮相。一个网站管理员例如可以建立一个形式，询问姓名，电子邮件和密码; 然后，通过使用CSS，将“电子邮件”字段，您希望找到的名称，反之亦然。因此，提交真实的形式将有一个字段称为“@” username ，但没有在该领域称为email 。该服务器，即预计这个，只是再次反转的两个领域。 A“刮”用手（或垃圾邮件机器人）建会做什么，很自然，并且在发送电子邮件email领域。并通过这样做，就原形毕露。通过形式与真正的CSS和JS意识到浏览器的工作一次，发送有意义的数据，窥探实际上被发送什么，我们也许能够克服这个特殊的障碍。或许，因为有真难方式。正如我所说， 太极拳 。

回到手头的情况，在这种情况下，表单包含三个字段，有没有使用Javascript叠加。我们有cPASS ， cUSR和checkLOGIN与“检查登录”的值。

所以我们用正确的领域做准备的形式。注意，表格要被发送作为application/x-www-form-urlencoded ，这在PHP卷曲意味着两两件事：

我们使用CURLOPT_POST
选项CURLOPT_POSTFIELDS必须是字符串 （数组将标志着卷曲提交尽可能multipart/form-data ，这可能会工作......也可能不会）。

表单域，因为它说，urlencoded的; 对此有专门的功能。

我们把action形式的领域; 这就是我们要使用它提交我们的认证（我们必须）的URL。

所以一切都准备好...

    $fields = array(
        'checkLOGIN' => 'Check Login',
        'cUSR'       => 'jb007',
        'cPASS'      => 'astonmartin',
    );
    $coded = array();
    foreach($fields as $field => $value)
        $coded[] = $field . '=' . urlencode($value);
    $string = implode('&', $coded);

    curl_setopt($ch, CURLOPT_URL,         $url1); //same URL as before, the login url generating the session ID
    curl_setopt($ch, CURLOPT_POST,        True);
    curl_setopt($ch, CURLOPT_POSTFIELDS,  $string);
    $ret    = curl_exec($ch);

我们现在期待一个“你好，詹姆斯 - 如何对国际象棋的一场漂亮的比赛” 页。但更重要的是，我们预计，随着保存的Cookie相关联的会话$cookiefile已经与关键信息提供- “用户身份验证”。

因此，所有后续的页面请求使用由$ch和相同的饼干罐将被授予访问权限，允许我们“刮”的网页很容易-只需记住设置请求模式返回到GET ：

    curl_setopt($ch, CURLOPT_POST,        False);

    // Start spidering
    foreach($urls as $url)
    {
        curl_setopt($ch, CURLOPT_URL, $url);
        $HTML = curl_exec($ch);
        if (False === $HTML)
        {
            // Something went wrong, check curl_error() and curl_errno().
        }
    }
    curl_close($ch);

在循环中，您有机会获得$HTML -每一个网页的HTML代码。

大使用正则表达式的诱惑。抵制它，你必须。为了更好地与不断变化的HTML，以及作为确保不露面假阳性或假阴性时的布局保持一致，但应对内容的变化 （例如你会发现你有很好的，图尔雷特勒旺的天气预报，卡斯塔涅尔，但从来没有阿斯普雷蒙或加蒂埃，而不是好奇的），最好的选择是使用DOM？

敛的A元素的href属性

文章来源: How can I scrape website content in PHP from a website that requires a cookie login?