Extracting HTML5 data attributes from a tag

2019-06-23 15:07发布

问题:

I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.

For example, given:

<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>

I want to get a hash like:

{ 'data-age' => '50', 'data-location' => 'London' }

I was originally hoping use a wildcard as part of my CSS selector, e.g.

Nokogiri(html).css('span[@data-*]').size

but it seems that isn't supported.

回答1:

Option 1: Grab all data elements

If all you need is to list all the page's data elements, here's a one-liner:

Hash[doc.xpath("//span/@*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]

Output:

{"data-age"=>"50", "data-location"=>"London"}

Option 2: Group results by tag

If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:

tags = []
datasets = "@*[starts-with(name(), 'data-')]"

#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
    tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end

Then tags is an array containing key-value hash pairs, grouped by tag.

Option 3: Behavior like the jQuery datasets plugin

If you'd prefer the plugin-like approach, the following will give you a dataset method on every Nokogiri node.

module Nokogiri
  module XML
    class Node
      def dataset
        Hash[self.xpath("@*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
      end
    end
  end
end

Then you can find the dataset for a single element:

doc.at_css("span").dataset

Or get the dataset for a group of elements:

doc.css("span").map(&:dataset)

Example:

The following is the behavior of the dataset method above. Given the following lines in the HTML:

<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>

The output would be:

[
 {"data-location"=>"London", "data-age"=>"50"},
 {"data-location"=>"Oxford", "data-age"=>"40"}
]


回答2:

You can do this with a bit of xpath:

doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/@*[starts-with(name(), 'data-')]"

This gets all the attributes of span elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.

Continuing the example (using the span in your question):

hash = data_attrs.each_with_object({}) do |n, hsh|
  hsh[n.name] = n.value
end

puts hash

produces:

{"data-age"=>"50", "data-location"=>"London"}


回答3:

Try looping through element.attributes while ignoring any attribue that does not start with a data-.



回答4:

The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':

Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
  def regex_attrs node_set, regex
    node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
  end
}.new)