Get products from e-shop using simple dom parser a

2019-03-04 02:55发布

问题:

I want to parse some products link, name and price. Here's my code: Having some trouble parsing, because I don't know how to get product link's and name's.Price is ok, I get it. And pagination not working as well

 <h2>Telefonai Pigu</h2>
</br>
<?php
  include_once('simple_html_dom.php'); 
  $url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";
  // Start from the main page
  $nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);


$phones = $html->find('div#productList span.product');

foreach($phones as $phone) {
    // Get the link
    $linkas = $phone->href;

    // Get the name
    $pavadinimas = $phone->find('a[alt]', 0)->plaintext;

    // Get the name price and extract the useful part using regex
    $kaina = $phone->find('strong[class=nw]', 0)->plaintext;
    // This captures the integer part of decimal numbers: In "123,45" will capture      "123"... Use @([\d,]+),?@ to capture the decimal part too

    echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

  //$query = "insert into telefonai (pavadinimas,kaina,linkas) VALUES (?,?,?)";
//  $this->db->query($query, array($pavadinimas,$kaina, $linkas));
}


// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div.pagination a[="rel"]', 0)) ? "https://www.pigu.lt".$temp->href : NULL );

// Clear DOM object
$html->clear();
unset($html);
}
?>

Output:

nextLink: http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26
#----# 999,00 Lt #----#
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26

回答1:

Please Inspect carefully the source code you're working on, then, based on that, you can retrive the nodes you want... It's normal that the compatible code with another website wont work here, since the two websites dont have the same source code/structure !

Lets proceed, again, step by step...

$phones = $html->find('div#productList span.product'); will give you all "phones containers", or what I called "blocks"... Each block has the following structure:

<span class="product">
   <div class="fakeProductContainer">
      <p class="productPhoto">
         <span class="">
         <span class="flags flag-disc-value" title="Akcija"><strong>500<br><span class="currencySymbol">Lt</span></strong></span>
         <span class="flags freeShipping" title="Nemokamas prekių atsiemimas į POST24 paštomatus. Pasiūlymas galioja iki sausio 31 d."></span>
         </span>
         <a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S" class="photo-medium nobr"><img src="http://lt1.pigugroup.eu//colours/48355/16/4835516/c503caf69ad97d889842a5fd5b3ff372_medium.jpg" title="Telefonas Sony Xperia acro S" alt="Telefonas Sony Xperia acro S"></a>
      </p>
      <div class="price">
         <strong class="nw">999,00 Lt</strong>
         <del class="nw">1.499,00 Lt *</del>
      </div>
      <h3><a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S">Sony Xperia acro S</a></h3>
      <p class="descFields">
         3G: <em>HSDPA 14.4 Mbps, HSUPA 5.76 Mbps</em><br>
         GPS: <em>Taip</em><br>
         NFC: <em>Taip</em><br>
         Operacinė sistema: <em>Android OS</em><br>
      </p>
   </div>
</span>

The anchor containing the product link an is included within <p class="productPhoto">, and it is the only anchor in there, so, to retrieve it simply use $linkas = $phone->find('p.productPhoto a', 0)->href; (then complete it since it's only the relative link)

The product name is located into <h3> tag, again, we use simply $pavadinimas = $phone->find('h3 a', 0)->plaintext; to retrieve it

The price is included within <div class="price"><strong>, and again we use $kaina = $phone->find('div[class=price] strong', 0)->plaintext to retrieve it

Hoever, not all phones have their price displayed, therefore, we must check if the price has been retrieved correctly or not

And finally, the HTML code containing the next link is the following:

<div id="ListFootPannel">
   <div class="pages-list">
      <strong>1</strong>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=2">2</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=3">3</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=4">4</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=5">5</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=6">6</a>
      <a rel="next" href="/foto_gsm_mp3/mobilieji_telefonai?page=2">Toliau</a>      
   </div>
   <div class="pages-info">
      Prekių 
   </div>
</div>

So, we are interested in <a rel="next"> tag, wich can be retrieved using $html->find('div#ListFootPannel a[rel="next"]', 0)

So, if we make add these modifications to your original code, we'll get:

$url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    ////////////////////////////////////////////////
    /// Get phone blocks and extract useful info ///
    ////////////////////////////////////////////////
    $phones = $html->find('div#productList span.product');

    foreach($phones as $phone) {
        // Get the link
        $linkas = "http://pigu.lt" . $phone->find('p.productPhoto a', 0)->href;

        // Get the name
        $pavadinimas = $phone->find('h3 a', 0)->plaintext;

        // If price not found, find() returns FALSE, then return 000
        if ( $tempPrice = $phone->find('div[class=price] strong', 0) ) {
            // Get the name price and extract the useful part using regex
            $kaina = $tempPrice->plaintext;
            // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([\d,]+),?@ to capture the decimal part too
            preg_match('@(\d+),?@', $kaina, $matches);
            $kaina = $matches[1];
        }
        else
            $kaina = "000";


        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

    }
    ////////////////////////////////////////////////
    ////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div#ListFootPannel a[rel="next"]', 0)) ? "http://pigu.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);

    echo "<hr>";
}

Working DEMO