Nokogiri and Xpath: find all text between two tags

I'm not sure if it's a matter of syntax or differences in versions but I can't seem to figure this out. I want to take data that is inside a (non-closing) td from the h2 tag to the h3 tag. Here is what the HTML would look like.

<td valign="top" width="350">
    <br><h2>NameIWant</h2><br>
    <br>Town<br>

    PhoneNumber<br>
    <a href="mailto:emailIwant@nowhere.com" class="links">emailIwant@nowhere.com</a>
    <br>
    <a href="http://websiteIwant.com" class="links">websiteIwant.com</a>
    <br><br>    
    <br><img src="images/spacer.gif"/><br>

    <h3><b>I want to stop before this!</b></h3>
    Lorem Ipsum Yadda Yadda<br>
    <img src="images/spacer.gif" border="0" width="20" height="11" alt=""/><br>
    <td width="25">
        <img src="images/spacer.gif" border="0" width="20" height="8" alt=""/>
        <td valign="top" width="200"><img src="images/spacer.gif"/>
            <br>
            <br>

            <table cellspacing="0" cellpadding="0" border="0"/>205"&gt;<tr><td>
                <a href="http://dontneedthis.com">
                </a></td></tr><br>
            <table border="0" cellpadding="3" cellspacing="0" width="200">
            ...

The <td valign> doesn't close until the very bottom of the page which I think might be why I'm having problems.

My Ruby code looks like:

require 'open-uri'
require 'nokogiri'

@doc = Nokogiri::XML(open("http://www.url.com"))

content = @doc.css('//td[valign="top"] [width="350"]')

name = content.xpath('//h2').text
puts name // Returns NameIwant

townNumberLinks = content.search('//following::h2')
puts content // Returns <h2> NameIWant </h2>

As I understand it following syntax should "Selects everything in the document after the closing tag of the current node". If I try to use preceding like:

townNumberLinks = content.search('//preceding::h3')
// I get: <h3><b>I want to stop before this!</b></h3>

Hope I made it clear what I'm trying to do. Thanks!

标签： html ruby xpath nokogiri

2条回答

Deceive 欺骗

2楼-- · 2019-01-26 16:09

Find all elements preceding the first <h3> in the cell, than retrieve all preceding siblings not having an <h2> tag as preceding sibling. Replace //td by the XPath expression to retrieve exactly this table cell.

//td/h3[1]/preceding-sibling::*[preceding-sibling::h2]

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-01-26 16:11

It's not trivial. In the context of the nodes you selected (the td), to get everything between two elements, you need to perform an intersection of these two sets:

Set A: All the nodes preceding the first h3: //h3[1]/preceding::node()
Set B: All the nodes following the first h2: //h2[1]/following::node()

To perform an intersection, you can use the Kaysian method (after Michael Kay, who proposed it). The basic formula is:

A[count(.|B) = count(B)]

Applying it to your sets, as defined above, where A = //h3[1]/preceding::node(), and B = //h2[1]/following::node(), we have:

//h3[1]/preceding::node()[ count( . | //h2[1]/following::node()) = count(//h2[1]/following::node()) ]

which will select all elements and text nodes starting with the first <br> after the </h2> tag, to the whitespace text node after the last <br>, just before the next <h3> tag.

You can easily select just the text nodes between h2 and h3 replacing node() for text() in the expression. This one will return all text nodes (including whitespace and linebreaks) between the two headers:

//h3[1]/preceding::text()[ count( . | //h2[1]/following::text()) = count(//h2[1]/following::text()) ]

0人赞添加讨论(0) 举报

Nokogiri and Xpath: find all text between two tags

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间