I do lots of work manipulating and analyzing PHP code. Normally I just use the Tokenizer to do this. For most applications this is sufficient. But sometimes parsing using a lexer just isn't reliable enough (obviously).
Thus I am looking for some PHP parser written in PHP. I found hnw/PhpParser and kumatch/stagehand-php-parser. Both are created by an automated conversion of zend_language_parser.y to a .y file with PHP instead of C (and then compiled to a LALR(1) parser). But this automated conversion just can't be worked with.
So, is there any decent PHP parser written in PHP? (I need one for PHP 5.2 and one for 5.3. But just one of them would be a good starting point, too.)
Well, this isn't in PHP, sorry, but building this kind of machinery is hard, and PHP isn't particularly suited for the task of language processing.
Our PHP Front End it provides full PHP 4.x and 5.x (EDIT 9/2016: now handles PHP 7) parsing, automatically builds ASTs with all the details of a full PHP grammar, can generate compilable source text from the ASTs. This is harder than it might sound when you consider all the screwy details including weird string literals, captured comments, numbers-with-radix, etc.
But ASTs are hardly enough (you've already observed that tokens aren't even barely enough).
The foundation on which it is built, the DMS Software Reengineering Toolkit provides support for analysis and arbitary transformations of the ASTs. It will also read large sets of files at once, enabling analysis and transformations across PHP files.
There is a port of ANTLR to PHP: http://code.google.com/p/antlrphpruntime/w/list
It's abandoned, but I think it should still work.
This isn't going to be a great option for you, as it violates the pure-PHP constraint, but:
A while ago, the php-internals folks decided that they would switch to Lemon as their parsing technology. There's a branch in the PHP svn repo that contains the required changes.
They decided not to continue with this, as they found that their Lemon solution is about 10-15% slower. But, the branch is still there.
There's an older Lemon parser written as a PHP extension. You might be able to work with it. There's also this PEAR package. There's also this other lemon package (via this blog post about PGN).
Of course, even if you get it working, I'm not sure what you'd do with the data, or what the data even looks like.
Another wacky option would be peeking at Quercus, a PHP implementation in Java. They'd have to have written a parser, maybe it might be worth investigating.
After no complete and stable parser was found here I decided to write one myself. Here is the result:
The project supports parsing code written for any PHP version between PHP 5.2 and PHP 7.1.
Apart from the parser itself the library provides some related components:
For an usage overview see the "Usage of basic components" section of the documentation.
The metrics tool PHP Depend contains code to generate an AST from PHP source written entirely in PHP. It does make use of PHP's own token_get_all for the tokenization however.
The source code is available on github: https://github.com/manuelpichler/pdepend/tree/master/src/main/php/PHP/Depend
The implementation of the AST for some parts like mathematical expressions was not yet complete last I checked, but according to its author that is the goal.