How can I get the absolute URL when extracting lin

2019-01-10 14:51发布

I'm using Nokogiri to extract links from a page but I would like to get the absolute path even though the one on the page is a relative one. How can I accomplish this?

标签: ruby nokogiri
3条回答
Viruses.
2楼-- · 2019-01-10 15:40

Nokogiri is unrelated, other than the fact that it gives you the link anchor to begin with. Use Ruby's URI library to manage paths:

absolute_uri = URI.join( page_url, href ).to_s

Seen in action:

require 'uri'

# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'

# A variety of links to test.
hrefs = %w[
  http://zork.com/             http://zork.com/#id
  http://zork.com/bar          http://zork.com/bar#id
  http://zork.com/bar/         http://zork.com/bar/#id
  http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
  /bar                         /bar#id
  /bar/                        /bar/#id
  /bar/jim.html                /bar/jim.html#id
  jim.html                     jim.html#id
  ../jim.html                  ../jim.html#id
  ../                          ../#id
  #id
]

hrefs.each do |href|
  root_href = URI.join(page_url,href).to_s
  puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/                 -> http://zork.com/
#=> http://zork.com/#id              -> http://zork.com/#id
#=> http://zork.com/bar              -> http://zork.com/bar
#=> http://zork.com/bar#id           -> http://zork.com/bar#id
#=> http://zork.com/bar/             -> http://zork.com/bar/
#=> http://zork.com/bar/#id          -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html     -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id  -> http://zork.com/bar/jim.html#id
#=> /bar                             -> http://foo.com/bar
#=> /bar#id                          -> http://foo.com/bar#id
#=> /bar/                            -> http://foo.com/bar/
#=> /bar/#id                         -> http://foo.com/bar/#id
#=> /bar/jim.html                    -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id                 -> http://foo.com/bar/jim.html#id
#=> jim.html                         -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id                      -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html                      -> http://foo.com/zee/jim.html
#=> ../jim.html#id                   -> http://foo.com/zee/jim.html#id
#=> ../                              -> http://foo.com/zee/
#=> ../#id                           -> http://foo.com/zee/#id
#=> #id                              -> http://foo.com/zee/zaw/zoom.html#id

The more convoluted answer here previously used URI.parse(root).merge(URI.parse(href)).to_s.
Thanks to @pguardiario for the improvement.

查看更多
爱情/是我丢掉的垃圾
3楼-- · 2019-01-10 15:45

Phrogz' answer is fine but more simply:

URI.join(base, url).to_s
查看更多
我命由我不由天
4楼-- · 2019-01-10 15:45

You need check if the URL is absolute or relative with check if begin by http: If the URL is relative you need add the host to this URL. You can't do that by nokogiri. You need process all url inside to render like absolute.

查看更多
登录 后发表回答