I need to parse DTDs using PHP and am hoping there's a simple library to help out. Each DTD has numerous <!ENTITY...
and <!-- Comment...
elements, which I need to act upon.
Note that I do not need to validate anything against these DTDs, simply parse them as data files themselves.
A few options I've looked at:
James Clarke's SD, which is an option of last resort, but I'd like to avoid the complexity of building/installing/configuring code external to PHP. I'm not sure it's even possible in my situation.
PEAR has an XML_DTD_Parser, which requires installing/configuring PEAR and a number of pear modules, which I'm also not sure is possible, and would rather avoid. Has anyone used it with success? EDIT: I've since learned that XML_DTD_Parser discards comments, so is not a valid option for my needs.
PHP XML Classes has the class_path_parser, which another site suggested, but it fails to read ENTITY elements. It appears to be using PHP's built in XML parsing capabilities, which use EXPAT.
PHP's DOMDocument will validate against a DTD, so must be able to read them, though I don't see how to get at the DTD parser directly at first glance.
None of the standard XML parsers for PHP give access to general entities*, and few give access to comments. PHP's built in XML Parser uses Expat, but does not expose the full expat API; in particular, a handler for entities cannot be set. There is a PHP bug filed to add this.
AFAICT, the only way to handle comments and general entities in a DTD parser is to write your own parser; either by hand, or using one of the lexers and parser generators available for php (e.g. PHP_LexerGenerator and PHP_ParserGenerator among others).
* PHP's expat wrapper (XML Parser) does give access to notation declarations, which are similar to, but not the same as general entities.
I don't know useful this will be...
If I understand what you're looking for, you're looking for a means to extract the and "nodes" from a DTD in order to act on them. Very interesting. Here's where my brain went:
- Use DOMDocument class directly. Looks as if there's no distinct way of getting at the DTD data if you treat the DTD as the source.
- Use the SimpleXML in the same way. Ditto.
- Use the XML parser in, again, the same way but use some of the entity declaration handler functions to get information out. I think this proves more foresight and is probably not what you need. (Although I could be wrong.)
- Use preg_match_all, or the like, to grab your values based on the patterns. Not to dissimilar to other thoughts in the world.
- Use XSLT to nix everything but what you need. The .xsl to remove all non-comments would be pretty easy to manage. It's quite possible you could just output them in a format that's easier to parse (say, in a better XML structure). Entities may require handling via PHP's XSL processor. I'm a little rusty on entities.
Regardless, I hope some of this helps.