I need to parse regular expressions into their components in PHP. I have no problem creating the regular expressions or executing them, but I want to display information about the regular expression (e.g. list the capture groups, attach repetition characters to their targets, ...). The overall project is a plugin for WordPress that gives info about the rewrite rules, which are regexes with substitution patterns, and can be cryptic to understand.
I have written a simple implementation myself, which seems to handle the simple regexes I throw at it and convert them to syntax trees. Before I expand this example to support more op the regex syntax I would like to know whether there are other good implementations I can look at. The implementation language does not really matter. I assume most parsers are written for optimizing matching speed, but that is not important for me, and may even hinder clarity.
What you need is a grammar and a way to generate a parser for it. The easiest approach to producing a parser is to code a recursive descent directly in your target language (e.g., in PHP), in which you build a clean parser that is shaped exactly like your grammar (which makes the parser maintainable, too).
Lots of details on how do to this, once you have a grammar, are provided in my SO description of how to build recursive descent parsers and additional theory details here
As for regex grammars, a simple grammar (maybe not the one you had in mind) is:
A recursive descent parser written in PHP to process this grammar should be on the order of few hundred lines, max.
Given this as a starting place, you should be able to add the features of PHP Regexes to it.
Happy parsing!
The perl module YAPE::Regex::Explain module can probably be ported to PHP pretty easy. Here is an example of its output
You can look at the source code and quickly see the implementation.
I would try to translate a ActionScript 1/2 regular expression library to PHP. Earlier versions of Flash didn't have native regex support, so there're a few libraries written in AS out there. Translating from one dynamic language into another should be much easier than trying to decipher C.
Here's one link perhaps worth looking at: http://www.jurjans.lv/flash/RegExp.html
Well, you can take a look at the implementation of the regex functions in php. As php is an open source project, all the sources and documentation is available to public.
You may be interested in a project I did last summer. It is a Javascript program which provides dynamic syntax highlighting of PCRE compatible regular expressions:
See: Dynamic (?:Regex Highlighting)++ with Javascript!
and the associated tester page
and the GitHub project page
The code uses (Javascript) regex to pick apart a (PCRE) regex into its various parts and applies markup to allow the user to mouse over various components and see the matching brackets and capture group numbers.
(I wrote it using regex because I didn't know any better! 8^)
I'm the creator of Debuggex, whose requirements are very similar to yours: optimize for the amount of information that can be shown.
Below is a heavily modified (for readablity) snippet from the parser that Debuggex uses. It doesn't work as-is, but is meant to demonstrate the organisation of the code. Most of the error handling was removed. So were many pieces of logic that were straightforward but verbose.
Note that recursive descent is used. This is what you've done in your parser, except yours is flattened into a single function. I used approximately this grammar for mine:
You'll notice a lot of my code is just dealing with the peculiarities of the javascript flavor of regexes. You can find more information about them at this reference. For PHP, this has all the information you need. I think you are very well on your way with your parser; all that remains is implementing the rest of the operators and getting the edge cases right.
:) Enjoy: