Scraping Library for PHP - phpQuery?

2019-03-09 18:52发布

问题:

I'm looking for a PHP library that allows me to scrap webpages and takes care about all the cookies and prefilling the forms with the default values, that's what annoys me the most.

I'm tired of having to match every single input element with xpath and I would love if something better existed. I've come across phpQuery but the manual isn't much clear and I can't find out how to make POST requests.

Can someone help me? Thanks.

@Jonathan Fingland:

In the example provided by the manual for browserGet() we have:

require_once('phpQuery/phpQuery.php');

phpQuery::browserGet('http://google.com/', 'success1');

function success1($browser)
{
    $browser->WebBrowser('success2')
    ->find('input[name=q]')->val('search phrase')
    ->parents('form')
    ->submit();
}

function success2($browser)
{
    echo $browser;
}

I suppose all the other fields are scrapped and send back in the GET request, I want to do the same with the phpQuery::browserPost() method but I don't know how to do it. The form I'm trying to scrape has a input token and I would love if phpQuery could be smart enough to scrape the token and just let me change the other fields (in this case username and password), submiting via POST everything.

PS: Rest assured, this is not going to be used for spamming.

回答1:

See http://code.google.com/p/phpquery/wiki/Ajax and in particular:

phpQuery::post($url, $data, $callback, $type)

and

# data Object, String which defines the data parameter as being either an Object or a String. POST requests should be possible using query string format, e.g.:

$data = "username=Jon&password=123456";
$url = "http://www.mysite.com/login.php";
phpQuery::post($url, $data, $callback, $type)

as phpQuery is a jQuery port the method signature is the same (the docs link directly to the jquery site -- http://docs.jquery.com/Ajax/jQuery.post)

Edit

Two things:

There is also a phpQuery::browserPost function which might meet your needs better.

However, also note that the success2 callback is only called on the submit() or click() methods so you can fill in all of the form fields prior to that.

e.g.

require_once('phpQuery/phpQuery.php');
phpQuery::browserGet('http://www.mysite.com/login.php', 'success1');
function success1($browser) {
  $handle = $browser
    ->WebBrowser('success2');
  $handle 
    ->find('input[name=username]')
      ->val('Jon');
  $handle 
    ->find('input[name=password]')
      ->val('123456');
      ->parents('form')
        ->submit();
}
function success2($browser) {
  print $browser;
}

(Note that this has not been tested, but should work)



回答2:

I've used SimpleTest's ScriptableBrowser for such stuff in the past. It's part of the SimpleTest testing framework, but you can use it stand-alone.



回答3:

I would use a dedicated library for parsing HTML files and a dedicated library for processing HTTP requests. Using the same library for both seems like a bad idea, IMO.

For processing HTTP requests, check out eg. Httpful, Unirest, Requests or Guzzle. Guzzle is especially popular these days, but in the end, whichever library works best for you is still a matter of personal taste.

For parsing HTML files I would recommend a library that I wrote myself : DOM-Query. It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app.