Converting remote relative paths to absolute paths

2019-07-26 08:13发布

I tried looking for a similar question, but was unable to.

I'm looking for push in the right direction. What I'm currently doing is gathering a list of all href values of a remote site, now since some of them can be relative paths, I need a function that builds an absolute path.

Since I have the domain name (by following the last url cUrl used):

$base_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

Now lets say $base_url value is: http://www.example.com/home/index.html and href value I'm currently reading is: /styles/ie.css

I need to turn the value of $base_url to http://www.example.com/styles/ie.css but I need that function to be dynamic as much as possible. Take a look at the possible scenarios (probably a not all):

1 = base url
2 = relative path
------------------------------------------------
1 http://www.example.com/
2 java/popups.js

1 + 2 = http://www.example.com/java/popups.js
------------------------------------------------
1 http://www.example.com
2 java/popups.js

1 + / + 2 = http://www.example.com/java/popups.js
------------------------------------------------
1 http://www.example.com/mysite/
2 ../java/popups.js 

1 - / + (2 - ..) = http://www.example.com/java/popups.js
------------------------------------------------

1 http://www.example.com/rsc/css/intlhplib-min.css
2 ../images/sunflower.png

1 - /css/intlhplib-min.css + (2 - ..) = http://www.example.com/rsc/images/sunflower.png     

标签: php html parsing
2条回答
一夜七次
2楼-- · 2019-07-26 08:17

I think you would need to use regular expressions on the href path to make sure that it is consistent. You can also get an accurate base url from parse_url():

<?php
$href = '../images/sunflower.png';
$href = preg_replace('~^\.{0,2}\/~', '', $href);
?>

Here we strip the periods and the slashes from the beginning of the string. And then prepend the base url:

<?php
$url = 'http://www.example.com/home/index.html';
$url = parse_url($url);

$abspath = $url['scheme'] . '://' . $url['host'] . '/' . $href;

echo $abspath;
?>

Should output what you want: http://www.example.com/images/sunflower.png

UPDATE

If you want the first directory from the base url, then use explode on the parsed url's path key:

$first_directory = '';
if (isset($url['path'])) {
    $patharray = explode('/', $url['path']);
    if (count($patharray)>2){
        $first_directory = explode('/', $url['path'])[1] . '/';
    }
}

And add that to the output variable:

$abspath = $url['scheme'] . '://' . $url['host'] . '/' . $first_directory . $href;

Another Update

To find how the href values relate to the base url, you can search for the occurence of ../ or / at the beginning of the href value, and then adjust your absolute url accordingly. This should help you figure out what the scenarios are:

<?php
$href = '../../images/sunflower.png';
preg_match('~^(\.{0,2}\/)+~', $href, $matches); //preg_match to check if it exists
if (substr_count($matches[0], '../')){ // substr_count to count number of '../'
    echo 'Go up ' . substr_count($matches[0], '../') . ' directories';
}
else if (substr_count($matches[0], '/')){
    echo 'Root directory';
}
else {
    echo 'Current directory';
}
?>

Check the demo on IDEONE.

查看更多
爷的心禁止访问
3楼-- · 2019-07-26 08:35

I ended up writing my own function, after a push in the right direction from @bozdoz.

The function takes two arguments, first one is $resource, which is the relative file path. And the second one is is the base url (which will be used to construct an absolute url).

This was design for my project purposes, I'm not sure it will fit anyone who is looking for a similar solution. Feel free to use it, and provide any efficiency improvements.

Updated version Thanks to Tim Cooper

function rel2abs_v2($resource, $base_url) 
{
$base_url = parse_url($base_url);

if(substr($resource, 0, 4) !== "http" && substr($resource, 0, 5) !== "https") // if no http/https is present, then {$resource} is a relative path.
{
# There is a "../" in the string
if (strpos($resource, "../") !== false)
{
$dir_count = substr_count($resource, "../");

$path_array = explode("/", $base_url["path"]);
$path_count = count($path_array); // 4
$path_index = ($path_count - $dir_count) - 2;

$resource = trim(str_replace("../", "", $resource));

if($path_index > 0) { $fs = "/"; }

if($dir_count > 0)
{
$base_url_path = implode("/", array_slice($path_array, $dir_count, $path_index - $dir_count + 1));
return $base_url['scheme'] . '://' . $base_url['host'] . $fs . $base_url_path ."/". $resource;
}
}

# Latest addition - remove if unexplained behaviour is in place.
if(starts_with($resource, "//"))
{
return trim(str_replace("//", "", $resource));      
}

if (starts_with($resource, "/"))
{
return $base_url["scheme"] . "://" . $base_url["host"] . $resource;
}
else
{
$path_array = explode("/", $base_url["path"]);

end($path_array);
$last_id = key($path_array);

return $base_url["scheme"] . "://" . $base_url["host"] . "/" . $path_array[--$last_id] . "/" . $resource;
}

}
else
{
return $resource;
}
} 
查看更多
登录 后发表回答