I am attempting to parse, not evaluate, rails ERB files in a Hpricot/Nokogiri type manner. The files I am attempting to parse contain HTML fragments intermixed with dynamic content generated using ERB (standard rails view files) I am looking for a library that will not only parse the surrounding content, much the way that Hpricot or Nokogiri will but will also treat the ERB symbols, <%, <%= etc, as though they were html/xml tags.
Ideally I would get back a DOM like structure where the <%, <%= etc symbols would be included as their own node types.
I know that it is possible to hack something together using regular expressions but I was looking for something a bit more reliable as I am developing a tool that I need to run on a very large view code base where both the html content and the erb content are important.
For example, content such as:
blah blah blah <div>My Great Text <%= my_dynamic_expression %></div>
Would return a tree structure like:
root - text_node (blah blah blah) - element (div) - text_node (My Great Text ) - erb_node (<%=)
I recently had a similar problem. The approach that I took was to write a small script (erblint.rb) do a string substitution to convert the ERB tags (
<% %>
and<%= %>
) to XML tags, and then parse using Nokogiri.See the following code to see what I mean:
I've posted this as a gist on Github: https://gist.github.com/787145
I eventually ended up solving this problem by using RLex, http://raa.ruby-lang.org/project/ruby-lex/, the ruby version of lex with the following grammer:
This is not a complete grammer but for my purposes, locating and re-emitting text, it worked. I combined that grammer with this small piece of code: