How to use RubyGem Sanitize transformers to saniti

2019-06-09 07:17发布

Any one familiar with the RubyGem Sanitize, that provide an example of building a "Transformer" to convert

"<ul><li>a</li><li>b</li><li>c</li></ul>" 

into

"a,b, and c"

?

1条回答
爷的心禁止访问
2楼-- · 2019-06-09 07:56

IMO transformers are not for pulling out data like this:

Transformers allow you to filter and modify nodes using your own custom logic [...]

This is not what you're trying to do; you're trying to pull data out of nodes, and transform it. In your example, you're not doing the same thing to each element: you're sometimes appending a comma, sometimes appending a comma and the word "and".

In order to do that, you either need to save state and post-process, or look ahead in the node stream to see if you're visiting the last node. I don't know of a trivial way to do that with Sanitize's transformers, so this example saves state and post-processes.

require 'sanitize'
items = []
s = "<ul><li>some space</li><li>more stuff with spaces</li><li>last one</li></ul>"
save_li = lambda do |env|
  node = env[:node]
  items << node.text.strip if node.text?
end
Sanitize.clean(s, :transformers => save_li)
# => "  some space  more stuff with spaces  last one  "    
output = "#{items[0..-2].join(", ")}, and #{items[-1]}"
# => "some space, more stuff with spaces, and last one"

IMO this example is an abuse of transformers because it's being run only for its side effect, it does nothing other than look for text nodes.

If one of the list items has embedded HTML, the naive approach no longer works, and you need to start knowing more Nokogiri anyway:

items = []
s = "<ul><li>some space</li><li>item <b>with<b/> html</li><li>c</li></ul>"
save_li = lambda do |env|
  node = env[:node]
  items << node.content if node.name == "li"
end
Sanitize.clean(s, :transformers => save_li)
# => "  some space  item with html  c  "
output = "#{items[0..-2].join(", ")}, and #{items[-1]}"    
# => "some space, item with html, and c"

This approach relies on the default Sanitize behavior of nothing being whitelisted. The <b> tags are still visited by the save_li lambda, but they're stripped. This has a potential to cause issues under a variety of circumstances.

查看更多
登录 后发表回答