PHP Web scraping of Javascript generated contents

2020-01-27 08:08发布

I am stuck with a scraping task in my project.

i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()

<?php
include("simple_html_dom.php");

$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');

 foreach($html->find('a') as $element)
   echo $element->href . '<br>'; 


  ?>

2条回答
混吃等死
2楼-- · 2020-01-27 08:38

In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.

By inspecting the source code you see something like this:

<tr>
      <td>&nbsp;Allendale</td>
      <td>&nbsp;Eastern Time
</td>
    </tr>
    <tr>
      <td>&nbsp;Alpine</td>
      <td>&nbsp;Eastern Time
</td>

So you just grab all the TR's

<?php
    include("simple_html_dom.php");

    $html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');

    $fp = fopen('output.csv', 'w');

    if (!$fp) die("Cannot open output CSV - permission problems maybe?");

    foreach($html->find('tr') as $tr) {
       $csv = array(); // Start empty. A new CSV row for each TR.
       // Now find the TD children of $tr. They will make up a row.
       foreach($tr->find('td') as $td) {
           // Get TD's innertext, but 
           $csv[] = $td->innertext;
       }
       fputcsv($fp, $csv);
    }

    fclose($fp);
  ?>

You will notice that the CSV text is "dirty". That is because the actual text is:

      <td>&nbsp;Alpine</td>
      <td>&nbsp;Eastern Time[CARRIAGE RETURN HERE]
          </td>

So to have "Alpine" and "Eastern Time", you have to replace

           $csv[] = $td->innertext;

with something like

           $csv[] = strip(
                html_entity_decode (
                  $td->innertext,
                  ENT_COMPAT | ENT_HTML401,
                  'UTF-8'
                )
           );

Check out the PHP man page for html_entity_decode() about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)

查看更多
家丑人穷心不美
3楼-- · 2020-01-27 08:53

Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.

So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:

http://www.govliquidation.com/json/buyer_ux/salescalendar.js

Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.

Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:

<?php

    $url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";

    $json = file_get_contents($url);

    $data = json_decode($json);

?>

This yields a data object that you can inspect and convert in CSV by simple looping.

stdClass Object
(
    [result] => stdClass Object
        (
            [events] => Array
                (
                    [0] => stdClass Object
                        (
                            [yahoo_dur] => 11300
                            [closing_today] => 0
                            [language_code] => en
                            [mixed_id] => 9297
                            [event_id] => 9297
                            [close_meridian] => PM
                            [commercial_sale_flag] => 0
                            [close_time] => 01/06/2014
                            [award_time_unixtime] => 1389070800
                            [category] => Tires, Parts & Components
                            [open_time_unixtime] => 1388638800
                            [yahoo_date] => 20140102T000000Z
                            [open_time] => 01/02/2014
                            [event_close_time] => 2014-01-06 17:00:00
                            [display_event_id] => 9297
                            [type_code] => X3
                            [title] => Truck Drive Axles @ Killeen, TX
                            [special_flag] => 1
                            [demil_flag] => 0
                            [google_close] => 20140106
                            [event_open_time] => 2014-01-02 00:00:00
                            [google_open] => 20140102
                            [third_party_url] =>
                            [bid_package_flag] => 0
                            [is_open] => 1
                            [fda_count] => 0
                            [close_time_unixtime] => 1389045600

You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.

查看更多
登录 后发表回答