PHP XMLReader, get the version and encoding

I'm currently rewriting a PHP class that tried to split an XML file into smaller chunks to use XMLReader and XMLWriter instead of the current basic filesystem and regex approach.

However, I can't figure out how to get the version, encoding and standalone flags from the XML preamble.

The start of my test XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE fakedoctype SYSTEM "fake_doc_type.dtd">

 <!--
 This is a comment, it's here to try and get the parser to break in some way
 --> 

<root attribute="value" otherattribute="othervalue">

I can open it okay with the reader and move through the document with read(), next() etc, but I just can't seem to get whatever's in <?xml ... ?>. The first thing I'm able to access is the fake DOCTYPE.

My testing code is as follows:

$a = new XMLReader ();
var_dump ($a -> open ('/path/to/test/file.xml')) // true
var_dump ($a -> nodeType); // 0
var_dump ($a -> name); // ""
var_dump ($a -> readOuterXML ()); // ''
var_dump ($a -> read ()); // true
var_dump ($a -> nodeType); // 10
var_dump ($a -> readOuterXML ()); // <!DOCTYPE fakedoctype SYSTEM "fake_doc_type.dtd">

Of course I could just always assume XML 1.0, encoding UTF8 and standalone = yes, but for the sake of correctness I'd really rather be able to grab what the values in my source feed are and use them when generating the split files.

The documentation on XMLReader and XMLwriter seems to be very poor, so there's every chance I've just missed something in the docs. Does anyone know what to do in this case?

What I know from XMLReader even it has the XMLReader::XML_DECLARATION constant, I have never experienced it when traversing the document with XMLReader::read() in the XMLReader::$nodeType property.

It looks like that it gets skipped and I also wondered why this is and I have not yet found any flag or option to change this behavior.

For the output, XMLReader always returns UTF-8 encoded strings. That's the same as with the other libxml based parts in PHP. So from that side, all is clear. But I assume that is not the part you're interested in, but the concrete string input in the file you open with XMLReader::open().

Not specifically for XMLReader I once created a utility class I named XMLRecoder which is able to detect the encoding of an XML string based on the XML declaration and also based on BOM. I think you should do both. That's one part I think you still need to use regular expressions for but as the XML declaration must be the first thing and also it is a processing instruction (PI) that is very well and strict defined you should be able to peek in there.

This is some related part from the XMLRecoder code:

### excerpt from https://gist.github.com/hakre/5194634 

/**
 * pcre pattern to access EncodingDecl, see <http://www.w3.org/TR/REC-xml/#sec-prolog-dtd>
 */
const DECL_PATTERN = '(^<\?xml\s+version\s*=\s*(["\'])(1\.\d+)\1\s+encoding\s*=\s*(["\'])(((?!\3).)*)\3)';
const DECL_ENC_GROUP = 4;
const ENC_PATTERN = '(^[A-Za-z][A-Za-z0-9._-]*$)';

...

($result = preg_match(self::DECL_PATTERN, $buffer, $matches, PREG_OFFSET_CAPTURE))
    && $result = $matches[self::DECL_ENC_GROUP];

As this shows it goes until encoding, so it's not complete. However for the needs to extract encoding (and for your needs version), it should do the job. I had run this against a tons (thousands) of random XML documents for testing.

Another part is the BOM detection:

### excerpt from https://gist.github.com/hakre/5194634 

const BOM_UTF_8 = "\xEF\xBB\xBF";
const BOM_UTF_32LE = "\xFF\xFE\x00\x00";
const BOM_UTF_16LE = "\xFF\xFE";
const BOM_UTF_32BE = "\x00\x00\xFE\xFF";
const BOM_UTF_16BE = "\xFE\xFF";

...

/**
 * @param string $string string (recommended length 4 characters/octets)
 * @param string $default (optional) if none detected what to return
 * @return string Encoding, if it can not be detected defaults $default (NULL)
 * @throws InvalidArgumentException
 */
public function detectEncodingViaBom($string, $default = NULL)
{
    $len = strlen($string);

    if ($len > 4) {
        $string = substr($string, 0, 4);
    } elseif ($len < 4) {
        throw new InvalidArgumentException(sprintf("Need at least four characters, %d given.", $len));
    }

    switch (true) {
        case $string === self::BOM_UTF_16BE . $string[2] . $string[3]:
            return "UTF-16BE";

        case $string === self::BOM_UTF_8 . $string[3]:
            return "UTF-8";

        case $string === self::BOM_UTF_32LE:
            return "UTF-32LE";

        case $string === self::BOM_UTF_16LE . $string[2] . $string[3]:
            return "UTF-16LE";

        case $string === self::BOM_UTF_32BE:
            return "UTF-32BE";
    }

    return $default;
}

With the BOM detection I also did run this against the same set of XML documents, however, not many were with BOMs. As you can see, the detection order is optimized for the more common scenarios while taking care of the duplicate binary patterns between the different BOMs. Most documents I encountered are w/o BOM and you mainly need it to find out if the document is UTF-32 encoded.

Hope this at least gives some insights.