How to get final URL after following HTTP redirect

2019-01-06 13:55发布

What I'd like to do is find out what is the last/final URL after following the redirections.

I would prefer not to use cURL. I would like to stick with pure PHP (stream wrappers).

Right now I have a URL (let's say http://domain.test), and I use get_headers() to get specific headers from that page. get_headers will also return multiple Location: headers (see Edit below). Is there a way to use those headers to build the final URL? or is there a PHP function that would automatically do this?

Edit: get_headers() follows redirections and returns all the headers for each response/redirections, so I have all the Location: headers.

4条回答
【Aperson】
2楼-- · 2019-01-06 14:39
/**
 * get_redirect_url()
 * Gets the address that the provided URL redirects to,
 * or FALSE if there's no redirect. 
 *
 * @param string $url
 * @return string
 */
function get_redirect_url($url){
    $redirect_url = null; 

    $url_parts = @parse_url($url);
    if (!$url_parts) return false;
    if (!isset($url_parts['host'])) return false; //can't process relative URLs
    if (!isset($url_parts['path'])) $url_parts['path'] = '/';

    $sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    if (!$sock) return false;

    $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
    $request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
    $request .= "Connection: Close\r\n\r\n"; 
    fwrite($sock, $request);
    $response = '';
    while(!feof($sock)) $response .= fread($sock, 8192);
    fclose($sock);

    if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
        if ( substr($matches[1], 0, 1) == "/" )
            return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
        else
            return trim($matches[1]);

    } else {
        return false;
    }

}

/**
 * get_all_redirects()
 * Follows and collects all redirects, in order, for the given URL. 
 *
 * @param string $url
 * @return array
 */
function get_all_redirects($url){
    $redirects = array();
    while ($newurl = get_redirect_url($url)){
        if (in_array($newurl, $redirects)){
            break;
        }
        $redirects[] = $newurl;
        $url = $newurl;
    }
    return $redirects;
}

/**
 * get_final_url()
 * Gets the address that the URL ultimately leads to. 
 * Returns $url itself if it isn't a redirect.
 *
 * @param string $url
 * @return string
 */
function get_final_url($url){
    $redirects = get_all_redirects($url);
    if (count($redirects)>0){
        return array_pop($redirects);
    } else {
        return $url;
    }
}

And, as always, give credit:

http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/

查看更多
叛逆
3楼-- · 2019-01-06 14:39

xaav answer is very good; except for the following two issues:

  • It does not support HTTPS protocol => The solution was proposed as a comment in the original site: http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/
  • Some sites will not work since they will not recognise the underlying user agent (client browser) => This is simply fixed by adding a User-agent header field: I added an Android user agent (you can find here http://www.useragentstring.com/pages/useragentstring.php other user agent examples according you your need):

    $request .= "User-Agent: Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30\r\n";

Here's the modified answer:

/**
 * get_redirect_url()
 * Gets the address that the provided URL redirects to,
 * or FALSE if there's no redirect. 
 *
 * @param string $url
 * @return string
 */
function get_redirect_url($url){
    $redirect_url = null; 

    $url_parts = @parse_url($url);
    if (!$url_parts) return false;
    if (!isset($url_parts['host'])) return false; //can't process relative URLs
    if (!isset($url_parts['path'])) $url_parts['path'] = '/';

    $sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    if (!$sock) return false;

    $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
    $request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
    $request .= "User-Agent: Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30\r\n";
    $request .= "Connection: Close\r\n\r\n"; 
    fwrite($sock, $request);
    $response = '';
    while(!feof($sock)) $response .= fread($sock, 8192);
    fclose($sock);

    if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
        if ( substr($matches[1], 0, 1) == "/" )
            return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
        else
            return trim($matches[1]);

    } else {
        return false;
    }

}

/**
 * get_all_redirects()
 * Follows and collects all redirects, in order, for the given URL. 
 *
 * @param string $url
 * @return array
 */
function get_all_redirects($url){
    $redirects = array();
    while ($newurl = get_redirect_url($url)){
        if (in_array($newurl, $redirects)){
            break;
        }
        $redirects[] = $newurl;
        $url = $newurl;
    }
    return $redirects;
}

/**
 * get_final_url()
 * Gets the address that the URL ultimately leads to. 
 * Returns $url itself if it isn't a redirect.
 *
 * @param string $url
 * @return string
 */
function get_final_url($url){
    $redirects = get_all_redirects($url);
    if (count($redirects)>0){
        return array_pop($redirects);
    } else {
        return $url;
}
查看更多
forever°为你锁心
4楼-- · 2019-01-06 14:40
function getRedirectUrl ($url) {
    stream_context_set_default(array(
        'http' => array(
            'method' => 'HEAD'
        )
    ));
    $headers = get_headers($url, 1);
    if ($headers !== false && isset($headers['Location'])) {
        return $headers['Location'];
    }
    return false;
}

Additionally...

As was mentioned in a comment, the final item in $headers['Location'] will be your final URL after all redirects. It's important to note, though, that it won't always be an array. Sometimes it's just a run-of-the-mill, non-array variable. In this case, trying to access the last array element will most likely return a single character. Not ideal.

If you are only interested in the final URL, after all the redirects, I would suggest changing

return $headers['Location'];

to

return is_array($headers['Location']) ? array_pop($headers['Location']) : $headers['Location'];

... which is just if short-hand for

if(is_array($headers['Location'])){
     return array_pop($headers['Location']);
}else{
     return $headers['Location'];
}

This fix will take care of either case (array, non-array), and remove the need to weed-out the final URL after calling the function.

In the case where there are no redirects, the function will return false. Similarly, the function will also return false for invalid URLs (invalid for any reason). Therefor, it is important to check the URL for validity before running this function, or else incorporate the redirect check somewhere into your validation.

查看更多
Rolldiameter
5楼-- · 2019-01-06 14:42

While the OP wanted to avoid cURL, it's best to use it when it's available. Here's a solution which has the following advantages

  • uses curl for all the heavy lifting, so works with https
  • copes with servers which return lower cased location header name (both xaav and webjay's answers do not handle this)
  • allows you to control how deep you want you go before giving up

Here's the function:

function findUltimateDestination($url, $maxRequests = 10)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, $maxRequests);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);

    //customize user agent if you desire...
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Link Checker)');

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_exec($ch);

    $url=curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

    curl_close ($ch);
    return $url;
}

Here's a more verbose version which allows you to inspect the redirection chain rather than let curl follow it.

function findUltimateDestination($url, $maxRequests = 10)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);

    //customize user agent if you desire...
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Link Checker)');

    while ($maxRequests--) {

        //fetch
        curl_setopt($ch, CURLOPT_URL, $url);
        $response = curl_exec($ch);

        //try to determine redirection url
        $location = '';
        if (in_array(curl_getinfo($ch, CURLINFO_HTTP_CODE), [301, 302, 303, 307, 308])) {
            if (preg_match('/Location:(.*)/i', $response, $match)) {
                $location = trim($match[1]);
            }
        }

        if (empty($location)) {
            //we've reached the end of the chain...
            return $url;
        }

        //build next url
        if ($location[0] == '/') {
            $u = parse_url($url);
            $url = $u['scheme'] . '://' . $u['host'];
            if (isset($u['port'])) {
                $url .= ':' . $u['port'];
            }
            $url .= $location;
        } else {
            $url = $location;
        }
    }

    return null;
}

As an example of redirection chain which this function handles, but the others do not, try this:

echo findUltimateDestination('http://dx.doi.org/10.1016/j.infsof.2016.05.005')

At the time of writing, this involves 4 requests, with a mixture of Location and location headers involved.

查看更多
登录 后发表回答