php scrapping and outputting a specific value or n

2019-03-05 08:32发布

问题:

so I'm very new to php. But with some help, I've figured out how to scrape a site if it has a tag identifier like h1 class=____

And even better, I've figured out how to output the precise word or value I want, as long as it's separated by a blank white space. So for example if a given tag name < INVENTORY > has an output of "30 balls," I can specify to echo[0], and only 30 will output. Which is great.

I'm running into an issue though, were I'm trying to extract a value that is not separated by a blank space. So what I mean is, let's say I want "-34.89," as the output (more precisely, whatever number is in that place holder on the site, since the numbers on the source site are likely changing over time).

But, my output I'm getting is "-34.89dowjonesstockchange". There's no blank space there.

What do I do to just output the -34.89? Or, whatever number may be in it's place on a given day. There must be some way to signify in that above output, to only output values [0,1,2,3,4,5] for ex, which would be -34.89 in terms of numbers of values.

The below is a test example on a website, that outputs words and values determined by " " blank space. Which is almost what I need, but missing this way of being even more precise.

// this function is a scrapping function for ethereumchange
function getEthereumchange(){
    $doc = new DOMDocument;

    // We don't want to bother with white spaces
    $doc->preserveWhiteSpace = false;


    $doc->strictErrorChecking = false;
    $doc->recover = true;

    $doc->loadHTMLFile('https://coinmarketcap.com/');



    $xpath = new DOMXPath($doc);

    $query = "//tr[@id='id-ethereum']";




    $entries = $xpath->query($query);
    foreach ($entries as $entry) {
        $result = trim($entry->textContent); 
        $ret_ = explode(' ', $result);
        //make sure every element in the array don't start or end with blank
        foreach ($ret_ as $key=>$val){
            $ret_[$key]=trim($val);
        }
        //delete the empty element and the element is blank "\n" "\r" "\t"
        //I modify this line
        $ret_ = array_values(array_filter($ret_,deleteBlankInArray));

        //echo the last element
        file_put_contents(globalVars::$_cache_dir . "ethereumchange", 
$ret_[7]);

    }

Thank you so much.

回答1:

If you want to use third party library you can use https://github.com/rajanrx/php-scrape

<?php

use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/');

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
    [
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Name',
                'xpath' => './/td[2]/a',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Market Cap',
                'xpath' => './/td[3]',
            ]
        ),
        new \Scraper\Structure\RegexField(
            [
                'name'  => '% Change',
                'xpath' => './/td[7]',
                'regex' => '/(.*)%/'
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

will print out following:

Array
(
    [0] => Array
        (
            [Name] => Bitcoin
            [Market Cap] => $42,495,710,233
            [% Change] => -1.09
            [hash] => 76faae07da1d2f8c1209d86301d198b3
        )

    [1] => Array
        (
            [Name] => Ethereum
            [Market Cap] => $28,063,517,955
            [% Change] => -8.10
            [hash] => 18ade4435c69b5116acf0909e174b497
        )

    [2] => Array
        (
            [Name] => Ripple
            [Market Cap] => $11,483,663,781
            [% Change] => -2.73
            [hash] => 5bf61e4bb969c04d00944536e02d1e70
        )

    [3] => Array
        (
            [Name] => Litecoin
            [Market Cap] => $2,263,545,508
            [% Change] => -3.36
            [hash] => ea205770c30ddc9cbf267aa5c003933e
        )
   and so on ...

I hope that helps you :)

Disclaimer: I am author of this library.



回答2:

if you only care about that change percentage, try this and remove the whole foreach section:

$query = "//tr[@id='id-ethereum']/td[contains(@class, 'percent-24h')]";
$entries = $xpath->query($query);

echo $entries->item(0)->getAttribute('data-usd'); //-5.15

here are the rest of the columns:

$xpath = new DOMXPath($doc);

$market_cap = $xpath->query("//tr[@id='id-ethereum']/td[contains(@class, 'market-cap')]");
echo $market_cap->item(0)->getAttribute('data-usd'); //30574084827.1


$price = $xpath->query("//tr[@id='id-ethereum']/td/a[contains(@class, 'price')]");
echo $price->item(0)->getAttribute('data-usd'); //329.567

$circulating_supply = $xpath->query("//tr[@id='id-ethereum']/td/a[@data-supply]");
echo $circulating_supply->item(0)->getAttribute('data-supply'); //92770467.9991


$volume = $xpath->query("//tr[@id='id-ethereum']/td/a[contains(@class, 'volume')]");
echo $volume->item(0)->getAttribute('data-usd'); //810454000.0


$percent_change = $xpath->query("//tr[@id='id-ethereum']/td[contains(@class, 'percent-24h')]");
echo $percent_change->item(0)->getAttribute('data-usd'); //-3.79