-->

How to use XMLReader to parse multiple, identicall

2019-06-07 12:19发布

问题:

I'm using XMLReader and PHP to process a moderately-sized XML file (6mb) and basically break up the attribute data and insert it into my own database. Problem is, each element has a variable number of subelements with identically named attributes.

Here's an example (this is open data about the government courtesy of govtrack.us):

<?xml version="1.0" ?>
<people>
    <person id='400001' lastname='Abercrombie' firstname='Neil' birthday='1938-06-26' gender='M' pvsid='26827' osid='N00007665' bioguideid='A000014' metavidid='Neil_Abercrombie' youtubeid='hawaiirep1' name='Rep. Neil Abercrombie [D, HI-1]' title='Rep.' state='HI' district='1' >
        <role type='rep' startdate='1985-01-03' enddate='1986-10-18' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1991-01-03' enddate='1992-10-09' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1993-01-05' enddate='1994-12-01' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1995-01-04' enddate='1996-10-04' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1997-01-07' enddate='1998-12-19' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='1999-01-06' enddate='2000-12-15' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='2001-01-03' enddate='2002-11-22' party='Democrat' state='HI' district='1' />
        <role type='rep' startdate='2003-01-07' enddate='2004-12-09' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
        <role type='rep' startdate='2005-01-04' enddate='2006-12-08' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
        <role type='rep' startdate='2007-01-04' enddate='2009-01-03' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
        <role type='rep' startdate='2009-01-06' enddate='2010-03-01' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
</person>

I don't need any fancy logic to be performed on the attributes. At the beginning of my script, I check to see if I've already processed this particular record (based on the 'id' attribute), and then I grab pretty much every attribute and parse it into my db. But there are two problems:

1) When I use this:

$p->getAttribute('id')

to get the 'id', it gives it to me twice, separate by as many line breaks as there are subelements in the element (I think the comment on this page speaks to that, but I'm not sure what to do about it).

2) How do I access the attributes of each subelement sequentially? This:

$p->getAttribute('startdate')

gives me every 'startdate' value separated by multiple line breaks. I just need to grab the id of the element and then cycle through each of the 'role' subelements.

Any ideas?

For edification, here's the super-simple controller I have so far:

$f = base_url().'data/people.xml';
$p = new XMLReader;
$p->open($f);
while($p->read())
{
    if($this->_notImported('govtrack',$p->getAttribute('id')))
    {
            // here I just grab the attributes, put them into arrays to insert, like so:
            $insert = array('indiv_name' => $full_name,
                                    'indiv_first' => ($p->getAttribute(‘firstname’)),
                                    'indiv_last' => ($p->getAttribute(‘lastname’)),
                                    'indiv_middle' => ($p->getAttribute(‘middlename’)),
                                    'indiv_other' => ($p->getAttribute(‘namemod’)),
                                    'indiv_full_name' => $full_name,
                                    'indiv_title' => ($p->getAttribute(‘title’)),
                                    'indiv_dob' => ($p->getAttribute(‘birthday’)),
                                    'indiv_gender' => ($p->getAttribute(‘gender’)),
                                    'indiv_religion' => ($p->getAttribute(‘religion’)),
                                    'indiv_url' => ($url)
                                    );

For the element, this is not as difficult, but I don't know how to cycle through each of the 'role' subelements and grab the attributes separately.

回答1:

Your first problem is that you are not checking for the appropriate nodeType, which is in fact related to the comment you linked: it matches both for the opening tag (ELEMENT) and the closing tag (END_ELEMENT).

Your second issue is also related to the missing nodeType check. After you fix that, you just have to check for the node's name to find out if it's a <role> or <person>.

Since I'm assuming you're also reading a large XML file, you probably want to know when you're passing to the next person tag... (via the END_ELEMENT nodeType) See my example below:

while($p->read()) {
    // check for nodeType here (opening tag only)
    if ($p->nodeType == XMLReader::ELEMENT) {
        if ($p->name == 'person') {
            if ($this->_notImported('govtrack',$p->getAttribute('id'))) {
                // $insert['indiv_*'] stuff here
            } else {
                $insert = null; // skip record because it's already imported
            }
        } else if ($p->name == 'role') {
            // role stuff here
            $startdate = $p->getAttribute('startdate');
        }

    // check for closing </person> tag here
    } else if ($p->nodeType == XMLReader::END_ELEMENT && $p->name == 'person') {
        if (isset($insert)) {
            // db insert here
        }
    }
}

By the way, your quotes must be replaced with proper quotes ' if you want this to work.