I'm working on a large PHP code base; I'd like to separate the PHP code from the HTML and JavaScript. (I need to do several automatic search-and-replaces on the PHP code, and different ones on the HTML, and different on the JS). Is there a good parser engine that could separate out the PHP for me? I could do this using regular expressions, but they're not perfect. I could build something in ANTLR, perhaps, but a good already existing solution would be best.
I should make clear: I don't want or need a full PHP parser. Just need to know if a given token is: - PHP code - PHP single quote string - PHP double quote string - PHP Comment - Not PHP, but rather HTML/JavaScript
How about the tokenizer built right into PHP itself?
You ask in the comments whether you can regenerate the code from the tokenized output - yet you can, all whitespace is preserved as T_WHITESPACE tokens. Here's how you might turn the tokenized output back into code:
If all you want to do is to inspect the tokens, then the PHP tokenizer, as others have suggested, might be a good choice.
If what you want to do is to automatically change the source code in a reliable way, I'm not sure that will help you. How will you regenerate the modified source text?
Another way to do this is to use a program transformation engine. Such an engine can parse the source text to abstract syntax trees, capturing the structure of the program (as well as the effective content of all the tokens), and allow searching and transforming of those ASTs using reliable pattern matches/transformations. To do this well, you need an engine that parses PHP reliably, and can reproduce compilable source text from the changed AST.
Our DMS Software Reengineering Toolkit is such a program transformation system, and it has a robust PHP Front End that can process PHP5 accurately in terms of parsing, transforming and prettyprinting the result back to text. (Getting the PHP parser right is hard because the language is poorly documented). Because the front end can pick up the HTML and the PHP code accurately, you don't need to separate out the text; they will parked in clearly distinguisable places in unique tree nodes.
To change all echoed strings from lowercase to uppercase, you'd use DMS to parse the PHP, and then apply the following transformation rule:
This rule is written in DMS's Rule Specification Language (RSL), which is clearly not PHP. The stuff inside quote marks is PHP code; those are meta quotes wrapped around the text of the programmming language being manipulated. The \ chararacter is an meta-escape: \s indicates a metavariable that must match a string literal, \uppercase is the name of a DMS function external to the RSL language and the ( ) are meta parentheses around the meta-function call to uppercase, applied to the matched string \s. Because the rule operates on the ASTs, it cannot be confused; it won't change the text of /* echo 'def' */ because that isn't a statement.
You likely need several rules to handle the variety of syntax combinations: STRING in this case refers to just singly-quoted literal strings; doubly-quoted strings aren't monolithic entities but are composed of a series of QUOTED_STRING_FRAGMENTS that correspond to the text in a doubly quoted string between the PHP expressions inside that doubly-quoted string.
At the end of the transformation process, the changed AST is emitted complete with the original indentation and comments except where the transformations have been applied.
There's also a fully language accurate JavaScript parser for DMS, too, which you'd need if you wanted to process the content of SCRIPT tags accurately.
If you want to make reliable changes to source code, this IMHO is the only good way to do it. You can try string hacking and regular expressions, but parsing PHP requires a context free parser and REs don't do that, so any result you get won't be trustworthy.
To separate the PHP from the rest, PHP's inbuilt tokenizer is your best choice: See
token_get_all()
For the rest, you might be best off with a DOM parser. Isolating the
<script>
parts (and external scripts, and evenonXXXX
events) is easy that way.It might be tough to re-build the identical document from a parsed DOM tree, though - I guess it depends on what you need to do with the results and how clean the original HTML is. A regular expression (yuck!) could work better for that part.