using readability API to scrape most relavant imag

2019-04-01 17:42发布

I am using readability API to do this. In their example they have show lead_img_url but I could not fetch it.

REference: https://www.readability.com/developers/api/parser

Is this correct way to make direct request:

  1. https://www.readability.com//api/content/v1/parser?url=http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/&token=1b830931777ac7c2ac954e9f0d67df437175e66e

  2. https://www.readability.com/parser/?token=1b830931777ac7c2ac954e9f0d67df437175e66e&url=http://nextbigwhat.com

it says: {"messages": "The API Key in the form of the 'token' parameter is invalid.", "error": true}

Another try:

<?php
    define('TOKEN', "1b830931777ac7c2ac954e9f0d67df437175e66e");    
    define('API_URL', "https://www.readability.com/api/content/v1/parser?url=%s&token=%s");

   function get_image($url) {   

    // sanitize it so we don't break our api url    
    $encodedUrl = urlencode($url);    
    $TOKEN = '1b830931777ac7c2ac954e9f0d67df437175e66e';    
    $API_URL = 'https://www.readability.com/api/content/v1/parser?url=%s&token=%s';    
//  $API_URL = 'http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas';    
    // build our url   
    $url = sprintf($API_URL, $encodedUrl, $TOKEN);    

    // call the api    
    $response = file_get_contents($url);    
    if( $response ) {    
        return false;   
    }    
    $json = json_decode($response);    
    if(!isset($json['lead_image_url'])) {    
        return false;    
    }    

    return $json['lead_image_url'];

}

Error: Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&amp;token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32

one more:

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content"; 

It says: Notice: Undefined index: lead_image_url in F:\wamp\www\inviteold\test2.php on line 13

标签: php parsing
1条回答
孤傲高冷的网名
2楼-- · 2019-04-01 17:48

First, in order to use the REST API that they provide, you need to create an account. Afterwards you can generate your own token to use in the call. The token provided by the examples will not work because it is purposefully invalid. Its purpose is for example only.

Second, make sure the allow_url_fopen directive in your php.ini file is set to true. For the purposes of a test script, or if you cannot change your php.ini file (shared hosting solutions), you can use ini_set('allow_url_fopen', true); at the top of your page.

Lastly, in order to parse the images yourself you'll need to retrieve all image elements from the DOM you retrieve. Sometimes there won't be any images, and sometimes there will be. It depends on what page you're pulling from. Additionally, you'll need to resolve relative paths...

Your Code

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content"; 

After executing Readability, you can utilize the DOMDocument class to retrieve your images from the contents you pulled. Instantiate a new DOMDocument and load in your HTML. Make sure to use the libxml_use_internal_errors function to supress errors caused by the parser on most websites. We'll put this in a function to make it easier to use elsewhere if needbe.

function sampleDomMedia($html) {
    // Supress validator errors
    libxml_use_internal_errors(true);

    // New document
    $dom = new DOMDocument();
    // Populate document
    $dom->loadHTML($html);
    //[...]

You can now retrieve all image elements from the document you instantiated, and then get their src attribute... like so:

    //[...]
    // Get image elements
    $nodeList = $dom->getElementsByTagName('img');

    // Get length
    $length = $nodeList->length;

    // Initialize array
    $images = array();

    // Iterate over our nodes
    for($i=0;$i<$length;$i++) {
        // Get the current node
        $node = $nodeList->item($i);
        // Retrieve the src attribute
        $image = $node->getAttribute('src');

        // Push image src into $images array
        array_push($images,$image);
    }

    return $images;
}

Now you have an array of images that you can present to the user for use. But before you do that, we forgot one more thing... We want to resolve all relative paths so that we always have an absolute path to the image that lives on another site.

To do this, we have to determine the base domain URL, and the relative path to the current page we're working with. We can do so using the parse_url() function provided by PHP. For simplicity's sake, we can throw this into a function.

function getUrls($url) {
    // Parse URL
    $urlArr = parse_url($url);

    // Determine Base URL, with scheme, host, and port
    $base = $urlArr['scheme']."://".$urlArr['host'];
    if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) {
        $base .= ":".$urlArr['port'];
    }

    // Truncate the Path using the position of the last forward slash
    $relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1);

    // Return our two URLs
    return array($base, $relative);
}

Add an additional parameter to the original sampleDomMedia function, and we can call this function to get our paths. Then we can check the src attribute's value to determine what kind of path it is, and resolve it.

function sampleDomMedia($html, $url) {
    // Retrieve our URLs
    list($baseUrl, $relativeUrl) = getUrls($url);

    libxml_use_internal_errors(true);

    $dom = new DOMDocument();
    $dom->loadHTML($html);

    $nodeList = $dom->getElementsByTagName('img');
    $length = $nodeList->length;
    $images = array();

    for($i=0;$i<$length;$i++) {
        $node = $nodeList->item($i);
        $image = $node->getAttribute('src');

        // Resolve relative paths
        if(substr($image,0,2)=="//") { // Missing protocol
            $image = "http:".$image;
        } else if(substr($image,0,1)=="/") { // Path Relative to Base
            $image = $baseUrl.$image;
        } else if(substr($image,0,4)!=="http") { // Path Relative to Dimension
            $image = $relativeUrl.$image;
        }

        array_push($images,$image);
    }

    return $images;
}

And last, but certainly not least, we're left with the two previous functions, and this piece of procedural code:

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image = $ReadabilityData['lead_image_url'];
$images = sampleDomMedia($html, $url);

$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content";

Also, if you think the contents of the article may have an image inside of it (usually doesn't), you can use the contents returned from Readability rather than the $html variable, like so:

$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
$images = sampleDomMedia($content, $url);

I hope that helps.

查看更多
登录 后发表回答