nokogiri + mechanize css selector by text

2019-07-17 20:42发布

I am new to nokogiri and so far most familiar with CSS selectors, I am trying to parse information from a table, below is a sample of the table and the code I'm using, I'm stuck on the appropriate if statement, as it seems to return the whole contents of the table.

Table:

<div class="holder">
  <div class ="row">
   <div class="c1">
     <!-- Content I Don't need -->
   </div>
   <div class="c2">
    <span class="data">
     <!-- Content I Don't Need -->
    <span class="data">
   </div>
 </div>
 ...
 <div class="row">
  <div class="c1">
   SPECIFIC TEXT
  </div>
  <div class="c2">
   <span class="data">
    What I want
   </span>
  </div>
 </div>
</div>

My Script: (if SPECIFIC TEXT is found in the table it returns every "div.c2 span.data" variable - so I've either screwed up my knowledge of do loops or if statements)

data = []
page.agent.get(url)
page.search('div.row').each do |row_data|
 if (row_data.search('div.c1:contains("/SPECIFIC TEXT/")').text.strip
  temp = row_data.search('div.c2 span.data').text.strip
  data << temp
 end
end

2条回答
劫难
2楼-- · 2019-07-17 21:22

I'd do

require 'nokogiri'

html = <<_
<div class="holder">
  <div class ="row">
   <div class="c1">
     <!-- Content I Don't need -->
   </div>
   <div class="c2">
    <span class="data">
     <!-- Content I Don't Need -->
    <span class="data">
   </div>
 </div>
 <div class="row">
  <div class="c1">
   SPECIFIC TEXT
  </div>
  <div class="c2">
   <span class="data">
    What I want
   </span>
  </div>
 </div>
</div>
_

doc = Nokogiri::HTML(html)
css_string = 'div.row > div.c1[text()*="SPECIFIC TEXT"] + div.c2 span.data'
doc.at(css_string).text.strip
# => "What I want"

How those selectors would work here -

查看更多
叼着烟拽天下
3楼-- · 2019-07-17 21:41

There's no need to stop and insert ruby logic when you can extract what you need in a single CSS selector.

data = page.search('div.row > div.c1:contains("SPECIFIC TEXT") + div.c2 span.data')

This will include only those that match the selector (e.g. follow the SPECIFIC TEXT).

Here's where your logic may have gone wrong:

This code

if (row_data.search('div.c1:contains("SPECIFIC TEXT")'...
  temp = row_data.search('div.c2 span.data')...

first searches the row for the specific text, then if it matches, returns ALL rows matching the second query, which has the same starting point. The key is the + in the CSS selector above which will return elements immediately following (e.g. the next sibling element). I'm making an assumption, of course, that the next element is always what you want.

查看更多
登录 后发表回答