How to avoid joining all text from Nodes when scra

When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.

For instance:

require \'nokogiri\'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
    <p>baz</p>
  </body>
</html>
EOT

doc.search(\'p\').text # => \"foobarbaz\"

But what I want is:

[\"foo\", \"bar\", \"baz\"]

The same happens when scraping XML:

doc = Nokogiri::XML(<<EOT)
<root>
  <block>
    <entries>foo</entries>
    <entries>bar</entries>
    <entries>baz</entries>
  </block>
</root>
EOT

doc.search(\'entries\').text # => \"foobarbaz\"

Why does this happen and how do I avoid it?

This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).

The NodeSet documentation says text will:

Get the inner text of all contained Node objects

Which is what we\'re seeing happen with:

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
    <p>baz</p>
  </body>
</html>
EOT

doc.search(\'p\').text # => \"foobarbaz\"

because:

doc.search(\'p\').class # => Nokogiri::XML::NodeSet

Instead, we want to get each Node and extract its text:

doc.search(\'p\').first.class # => Nokogiri::XML::Element
doc.search(\'p\').first.text # => \"foo\"

which can be done using map:

doc.search(\'p\').map { |node| node.text } # => [\"foo\", \"bar\", \"baz\"]

Ruby allows us to write that more concisely using:

doc.search(\'p\').map(&:text) # => [\"foo\", \"bar\", \"baz\"]

The same things apply whether we\'re working with HTML or XML, as HTML is a more relaxed version of XML.

A Node has several aliased methods for getting at its embedded text. From the documentation:

#content ⇒ Object

Also known as: text, inner_text

Returns the contents for this Node.

How to avoid joining all text from Nodes when scra

问题:

回答1:

收藏的人(0)

How to avoid joining all text from Nodes when scra

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮