Using Nokogiri's CSS method to get all element

2019-09-09 21:39发布

问题:

I am trying to use Nokogiri's CSS method to get some names from my HTML.

This is an example of the HTML:

<section class="container partner-customer padding-bottom--60">
    <div>
        <div>
            <a id="technologies"></a>
            <h4 class="center-align">The Team</h4>
        </div>
    </div>
    <div class="consultant list-across wrap">
        <div class="engineering">
            <img class="" src="https://v0001.jpg" alt="Person 1"/>
            <p>Person 1<br>Founder, Chairman &amp; CTO</p>
        </div>
        <div class="engineering">
            <img class="" src="https://v0002.png" alt="Person 2"/></a>
            <p>Person 2<br>Founder, VP of Engineering</p>
        </div>
        <div class="product">
            <img class="" src="https://v0003.jpg" alt="Person 3"/></a>
            <p>Person 3<br>Product</p>
        </div>
        <div class="Human Resources &amp; Admin">
            <img class="" src="https://v0004.jpg" alt="Person 4"/></a>
            <p>Person 4<br>People &amp; Places</p>
        </div>
        <div class="alliances">
            <img class="" src="https://v0005.jpg" alt="Person 5"/></a>
            <p>Person 5<br>VP of Alliances</p>
        </div>

What I have so far in my people.rake file is the following:

  staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
  all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)

I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.

Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.

How could I simply get the element within alt?

回答1:

Your desired output isn't clear and the HTML is broken.

Start with this:

require 'nokogiri'

doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]

Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"

This behavior is documented in NodeSet#text:

Get the inner text of all contained Node objects

Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:

Returns the content for this Node

doc.search('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.