Resolving absolute path from relative path

2019-07-14 06:28发布

问题:

I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path. I took 2 test sites. One in ROR and 1 made using Pyro CMS.

In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz, then my crawler will append and make it http://example.com/xyz/index.php. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php. So if I crawl http://example.com/xyz/index.php, I'll find another "index.php" which gets appended again.

While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.

I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.

回答1:

In Go, package path is your friend.

You can get the directory or folder from a path with path.Dir(), e.g.

p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"

If you find a link with root path (starts with a slash), you can use that as-is.

If it is relative, you can join it with the dir above using path.Join(). Join() will also "clean" the url:

p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)

Output:

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

The "cleaning" tasks performed by path.Join() are done by path.Clean() which you can manually call on any path of course. They are:

  1. Replace multiple slashes with a single slash.
  2. Eliminate each . path name element (the current directory).
  3. Eliminate each inner .. path name element (the parent directory) along with the non-.. element that precedes it.
  4. Eliminate .. elements that begin a rooted path: that is, replace "/.." by "/" at the beginning of a path.

And if you have a "full" url (with schema, host, etc.), you can use the url.Parse() function to obtain a url.URL value from the raw url string which tokenizes the url for you, so you can get the path like this:

uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
    fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)

Output:

Path: /xyz/index.php

Try all the examples on the Go Playground.