parse invalid XML manually

2020-02-13 06:21发布

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ have an XML that is not valid, there are many problems in the file itself, and I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ need to do daily reimports from that file. The structure looks like this:

<products>
    <product no="AP1222-00" name="Colours kravata" price="456" currency="Kč">
        <description name="POPIS PRODUKTU">Kravata Premier Line v moderních barvách. Materiál polyester. Baleno v sáčku s černým poutkem.</description>
    </product>
    <product no="AP1222-22" name="Colours kravata" price="330" currency="Kč">
        <description name="POPIS PRODUKTU">Blabla.</description>
    </product>
</products>

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­s there any easy way to get the array of products, so I can fix the problems in t­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­he files before importing it? SimpleXML etc. don't work, as the file is invalid.

Edit: Here's one complete products of the XML for reference, notice the double quotes in product name:

<products>
    <product no="AP1222-00" name="" Colours" kravata" price="456" currency="Kč">
        <folders>
            <folder category="<b>COOL 2017</b>" subcategory="TEXTILE & FASHION"/>
            <folder category="TEXTILE & FASHION" subcategory="Kravaty a šály"/>
        </folders>
        <description name="POPIS PRODUKTU">Kravata Premier Line v moderních barvách. Materiál polyester. Baleno v sáčku s
            černým poutkem.
        </description>
        <properties>
            <property name="KS / KARTON" value="100"/>
            <property name="HMOTNOST KARTONU" value="6"/>
            <property name="NETTO HMOTNOST / KARTON" value="5"/>
            <property name="DIM1" value="15"/>
            <property name="DIM2" value="80"/>
            <property name="DIM3" value="35"/>
            <property name="TECHNOLIGIE POTISKU" value="T1 (8C, 50×80 MM)"/>
            <property name="TARIF" value="6215200090"/>
            <property name="Min. mn. (ks)" value=""/>
            <property name="M3/CARTON" value="0.042"/>
            <property name="COOL 2017 KAPITOLA" value="TEXTILE AND FASHION"/>
            <property name="COOL 2017 STRANY" value="525"/>
            <property name="main category" value="fashion"/>
        </properties>
        <images>
            <image src="http://www.andapresent.com/kepek/cms/original/83653.jpg"/>
        </images>
        <stocks>
            <stock name="navi_central" value="2"/>
            <stock name="navi_arrive" value="" date=""/>
            <stock name="eu_central" value="" date=""/>
            <stock name="eu_arrive_1" value="" date=""/>
            <stock name="eu_arive_2" value="" date=""/>
        </stocks>
    </product>
</products>

标签: php xml parsing
1条回答
来,给爷笑一个
2楼-- · 2020-02-13 06:39

DOMDocument::loadHTML method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.

That's why I suggest an other approach with DOMDocument::loadXML (that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)

When you switch libxml_use_internal_errors() to true, all xml errors are stored in an array of libXMLErr instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).

$xml = file_get_contents('file.xml');

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$errors = libxml_get_errors();

if ($errors) {
    // LIBXML constant name, LIBXML error code // LIBXML error message
    define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
    define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
    define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name

    $rules = [
        XML_ERR_LT_IN_ATTRIBUTE => [
            'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
            'replacement' => [ 'string' => '&lt;', 'size' => 3 ]
        ],
        XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
            'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
            'replacement' => [ 'string' => '&quot;$1&quot;', 'size' => 10 ]
        ],
        XML_ERR_NAME_REQUIRED => [
            'pattern' => '~^.{%d}[^&]*\K&~',
            'replacement' => [ 'string' => '&amp;', 'size' => 4 ]
        ]
    ];

    $previousLineNo = 0;
    $lines = explode("\n", $xml);

    foreach ($errors as $error) {

        if (!isset($rules[$error->code])) continue;

        $currentLineNo = $error->line;

        if ( $currentLineNo != $previousLineNo )
            $offset = -1;

        $currentLine = &$lines[$currentLineNo - 1];
        $pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
        $currentLine = preg_replace($pattern,
                                    $rules[$error->code]['replacement']['string'],
                                    $currentLine, -1, $count);
        $offset += $rules[$error->code]['replacement']['size'] * $count;
        $previousLineNo = $currentLineNo;
    }

    $xml = implode("\n", $lines);

    libxml_clear_errors();
    $dom->loadXML($xml);
    $errors = libxml_get_errors();
}

var_dump($errors);

$s = simplexml_import_dom($dom);

echo $s->product[0]["name"];

The size in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset.

libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.

查看更多
登录 后发表回答