Get link and href text from html doc with Nokogiri

2019-02-02 21:35发布

I'm trying to use the nokogiri gem to extract all the urls on the page as well their link text and store the link text and url in a hash.

<html>
    <body>
        <a href=#foo>Foo</a>
        <a href=#bar>Bar </a>
    </body>
</html>

I would like to return

{"Foo" => "#foo", "Bar" => "#bar"}

标签: ruby nokogiri
2条回答
做个烂人
2楼-- · 2019-02-02 22:08

Another way:

h = doc.css('a[href]').each_with_object({}) { |n, h| h[n.text.strip] = n['href'] }
# yields {"Foo"=>"#foo", "Bar"=>"#bar"}

And if you're worried that you might have the same text linking to different things then you collect the hrefs in arrays:

h = doc.css('a[href]').each_with_object(Hash.new { |h,k| h[k] = [ ]}) { |n, h| h[n.text.strip] << n['href'] }
# yields {"Foo"=>["#foo"], "Bar"=>["#bar"]}
查看更多
Root(大扎)
3楼-- · 2019-02-02 22:23

Here's a one-liner:

Hash[doc.xpath('//a[@href]').map {|link| [link.text.strip, link["href"]]}]

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}

Split up a bit to be arguably more readable:

h = {}
doc.xpath('//a[@href]').each do |link|
  h[link.text.strip] = link['href']
end
puts h

#=> {"Foo"=>"#foo", "Bar"=>"#bar"}
查看更多
登录 后发表回答