I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake
file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt=""
tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant
, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO
, instead of just the person's name in alt=
.
How could I simply get the element within alt
?
Your desired output isn't clear and the HTML is broken.
Start with this:
Using
text
on the output ofcss
isn't a good idea.css
returns a NodeSet.text
against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:This behavior is documented in NodeSet#text:
Instead, use
text
(AKAinner_text
orcontent
) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:See "How to avoid joining all text from Nodes when scraping" also.