I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path.
I took 2 test sites. One in ROR and 1 made using Pyro CMS.
In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz
, then my crawler will append and make it http://example.com/xyz/index.php
. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php
. So if I crawl http://example.com/xyz/index.php
, I'll find another "index.php" which gets appended again.
While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.
I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.
In Go, package path
is your friend.
You can get the directory or folder from a path with path.Dir()
, e.g.
p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"
If you find a link with root path (starts with a slash), you can use that as-is.
If it is relative, you can join it with the dir
above using path.Join()
. Join()
will also "clean" the url:
p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)
Output:
p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php
The "cleaning" tasks performed by path.Join()
are done by path.Clean()
which you can manually call on any path of course. They are:
- Replace multiple slashes with a single slash.
- Eliminate each
.
path name element (the current directory).
- Eliminate each inner
..
path name element (the parent directory) along with the non-..
element that precedes it.
- Eliminate
..
elements that begin a rooted path: that is, replace "/.."
by "/"
at the beginning of a path.
And if you have a "full" url (with schema, host, etc.), you can use the url.Parse()
function to obtain a url.URL
value from the raw url string which tokenizes the url for you, so you can get the path like this:
uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)
Output:
Path: /xyz/index.php
Try all the examples on the Go Playground.