I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.
For example, given:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
I want to get a hash like:
{ 'data-age' => '50', 'data-location' => 'London' }
I was originally hoping use a wildcard as part of my CSS selector, e.g.
Nokogiri(html).css('span[@data-*]').size
but it seems that isn't supported.
Option 1: Grab all data elements
If all you need is to list all the page's data elements, here's a one-liner:
Hash[doc.xpath("//span/@*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]
Output:
{"data-age"=>"50", "data-location"=>"London"}
Option 2: Group results by tag
If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:
tags = []
datasets = "@*[starts-with(name(), 'data-')]"
#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end
Then tags
is an array containing key-value hash pairs, grouped by tag.
Option 3: Behavior like the jQuery datasets plugin
If you'd prefer the plugin-like approach, the following will give you a dataset
method on every Nokogiri node.
module Nokogiri
module XML
class Node
def dataset
Hash[self.xpath("@*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
end
end
end
end
Then you can find the dataset for a single element:
doc.at_css("span").dataset
Or get the dataset for a group of elements:
doc.css("span").map(&:dataset)
Example:
The following is the behavior of the dataset
method above. Given the following lines in the HTML:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>
The output would be:
[
{"data-location"=>"London", "data-age"=>"50"},
{"data-location"=>"Oxford", "data-age"=>"40"}
]
You can do this with a bit of xpath:
doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/@*[starts-with(name(), 'data-')]"
This gets all the attributes of span
elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.
Continuing the example (using the span
in your question):
hash = data_attrs.each_with_object({}) do |n, hsh|
hsh[n.name] = n.value
end
puts hash
produces:
{"data-age"=>"50", "data-location"=>"London"}
Try looping through element.attributes
while ignoring any attribue that does not start with a data-
.
The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':
Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
def regex_attrs node_set, regex
node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
end
}.new)