Extracting multiple-line content under header tags

2019-08-02 04:45发布

I posted a similar question that did not take into account multiple lines in the body. I have an html like so that I want to extract the "bodies" of (using Nokogiri):

html = %q|
    <div class="content">
      <h1>Title 1</h1>
        Lorem ipsum 1

      <h2>Title 2</h2>
        Lorem ipsum 2

      <h3>Title 3</h3>
        <p>paragraph content 1</p>
        <b>Lorem ipsum 3</b>
        <p>paragraph content 2</p>

      <h1>Title 4</h1>
        Lorem ipsum 4

      <h2>Title 5</h2>
        Lorem ipsum 5
   </div>
   |

I want to extract the body content under each header title and place them into an array like so:

[
  "Lorem ipsum 1",
  "Lorem ipsum 2",
  "<p>paragraph content 1</p><b>Lorem ipsum 3</b><p>paragraph content 2</p>",
  "Lorem ipsum 4",
  "Lorem ipsum 5"
]

However, when I do this:

Nokogiri::HTML(html).
  css("div").
  children.
  reject{|e| e.name =~ /\Ah\d\z/}.
  map{|e| e.to_html.strip}.reject(&:empty?)

I get this array instead:

[
  "Lorem ipsum 1",
  "Lorem ipsum 2",
  "<p>paragraph content 1</p>",
  "<b>Lorem ipsum 3</b>",
  "<p>paragraph content 2</p>",
  "Lorem ipsum 4",
  "Lorem ipsum 5"
]

Is there a way to extract the multiple line "body" content to display my desired array?

1条回答
疯言疯语
2楼-- · 2019-08-02 05:20
Nokogiri::HTML(html)
.css("div").children
.slice_before{|e| e.name =~ /\Ah\d\z/}
.map{|a| a.drop(1).map{|e| e.to_html.strip}.join}.reject(&:empty?)
查看更多
登录 后发表回答