Using Nokogiri's CSS method to get all element

I am trying to use Nokogiri's CSS method to get some names from my HTML.

This is an example of the HTML:

<section class="container partner-customer padding-bottom--60">
    <div>
        <div>
            <a id="technologies"></a>
            <h4 class="center-align">The Team</h4>
        </div>
    </div>
    <div class="consultant list-across wrap">
        <div class="engineering">
            <img class="" src="https://v0001.jpg" alt="Person 1"/>
            <p>Person 1<br>Founder, Chairman &amp; CTO</p>
        </div>
        <div class="engineering">
            <img class="" src="https://v0002.png" alt="Person 2"/></a>
            <p>Person 2<br>Founder, VP of Engineering</p>
        </div>
        <div class="product">
            <img class="" src="https://v0003.jpg" alt="Person 3"/></a>
            <p>Person 3<br>Product</p>
        </div>
        <div class="Human Resources &amp; Admin">
            <img class="" src="https://v0004.jpg" alt="Person 4"/></a>
            <p>Person 4<br>People &amp; Places</p>
        </div>
        <div class="alliances">
            <img class="" src="https://v0005.jpg" alt="Person 5"/></a>
            <p>Person 5<br>VP of Alliances</p>
        </div>

What I have so far in my people.rake file is the following:

  staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
  all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)

I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.

Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.

How could I simply get the element within alt?

Your desired output isn't clear and the HTML is broken.

Start with this:

require 'nokogiri'

doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]

Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"

This behavior is documented in NodeSet#text:

Get the inner text of all contained Node objects

Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:

Returns the content for this Node

doc.search('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Using Nokogiri's CSS method to get all element

问题:

回答1:

收藏的人(0)

Using Nokogiri's CSS method to get all element

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮