Getting final urls of shortened urls (like bit.ly)

2020-02-01 12:37发布

问题:

[Updated At Bottom]
Hi everyone.

Start With Short URLs:
Imagine that you've got a collection of 5 short urls (like http://bit.ly) in a php array, like this:

$shortUrlArray = array("http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123",
"http://bit.ly/123");

End with Final, Redirected URLs:
How can I get the final url of these short urls with php? Like this:

http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html
http://www.example.com/some-directory/some-page.html

I have one method (found online) that works well with a single url, but when looping over multiple urls, it only works with the final url in the array. For your reference, the method is this:

function get_web_page( $url ) 
{ 
    $options = array( 
        CURLOPT_RETURNTRANSFER => true,     // return web page 
        CURLOPT_HEADER         => true,    // return headers 
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects 
        CURLOPT_ENCODING       => "",       // handle all encodings 
        CURLOPT_USERAGENT      => "spider", // who am i 
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect 
        CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect 
        CURLOPT_TIMEOUT        => 120,      // timeout on response 
        CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects 
    ); 

    $ch      = curl_init( $url ); 
    curl_setopt_array( $ch, $options ); 
    $content = curl_exec( $ch ); 
    $err     = curl_errno( $ch ); 
    $errmsg  = curl_error( $ch ); 
    $header  = curl_getinfo( $ch ); 
    curl_close( $ch ); 

    //$header['errno']   = $err; 
    //$header['errmsg']  = $errmsg; 
    //$header['content'] = $content; 
    print($header[0]); 
    return $header; 
}  


//Using the above method in a for loop

$finalURLs = array();

$lineCount = count($shortUrlArray);

for($i = 0; $i <= $lineCount; $i++){

    $singleShortURL = $shortUrlArray[$i];

    $myUrlInfo = get_web_page( $singleShortURL ); 

    $rawURL = $myUrlInfo["url"];

    array_push($finalURLs, $rawURL);

}

Close, but not enough
This method works, but only with a single url. I Can't use it in a for loop which is what I want to do. When used in the above example in a for loop, the first four elements come back unchanged, and only the final element is converted into its final url. This happens whether your array is 5 elements or 500 elements long.

Solution Sought:
Please give me a hint as to how you'd modify this method to work when used inside of a for loop with collection of urls (Rather than just one).

-OR-

If you know of code that is better suited for this task, please include it in your answer.

Thanks in advance.

Update:
After some further prodding I've found that the problem lies not in the above method (which, after all, seems to work fine in for loops) but possibly encoding. When I hard-code an array of short urls, the loop works fine. But when I pass in a block of newline-seperated urls from an html form using GET or POST, the above mentioned problem ensues. Are the urls somehow being changed into a format not compatible with the method when I submit the form????

New Update:
You guys, I've found that my problem was due to something unrelated to the above method. My problem was that the URL encoding of my short urls converted what i thought were just newline characters (separating the urls) into this: %0D%0A which is a line feed or return character... And that all short urls save for the final url in the collection had a "ghost" character appended to the tail, thus making it impossible to retrieve the final urls for those only. I identified the ghost character, corrected my php explode, and all works fine now. Sorry and thanks.

回答1:

This may be of some help: How to put string in array, split by new line?

You would probably do something like this, assuming you're getting the URLs returned in POST:

$final_urls = array();

$short_urls = explode( chr(10), $_POST['short_urls'] ); //You can replace chr(10) with "\n" or "\r\n", depending on how you get your urls. And of course, change $_POST['short_urls'] to the source of your string.

foreach ( $short_urls as $short ) {
    $final_urls[] = get_web_page( $short );
}

I get the following output, using var_dump($final_urls); and your bit.ly url:

http://codepad.org/8YhqlCo1

And my source: $_POST['short_urls'] = "http://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123\nhttp://bit.ly/123";

I also got an error, using your function: Notice: Undefined offset: 0 in /var/www/test.php on line 27 Line 27: print($header[0]); I'm not sure what you wanted there...

Here's my test.php, if it will help: http://codepad.org/zI2wAOWL



回答2:

I think you are almost have it there. Try this:

$shortUrlArray = array("http://yhoo.it/2deaFR",
    "http://bit.ly/900913",
    "http://bit.ly/4m1AUx");

    $finalURLs = array();

    $lineCount = count($shortUrlArray);

    for($i = 0; $i < $lineCount; $i++){
            $singleShortURL = $shortUrlArray[$i];
            $myUrlInfo = get_web_page( $singleShortURL );
            $rawURL = $myUrlInfo["url"];
             printf($rawURL."\n");
            array_push($finalURLs, $rawURL);
    }


回答3:

I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:

<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);

// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");

foreach($bitly_urls as $bitly_url) {
  $c = curl_init($bitly_url);
  curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
  curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
  curl_setopt($c, CURLOPT_HEADER, 1);
  curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
  // curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
  // curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
  $r = curl_exec($c);

  // get the redirect url:
  $redirect_url = curl_getinfo($c)['redirect_url'];

  // write output as csv
  $out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
  fwrite($w_out, $out);
}
fclose($w_out);

Have fun and enjoy! pw