I'm using XMLReader and PHP to process a moderately-sized XML file (6mb) and basically break up the attribute data and insert it into my own database. Problem is, each element has a variable number of subelements with identically named attributes.
Here's an example (this is open data about the government courtesy of govtrack.us):
<?xml version="1.0" ?>
<people>
<person id='400001' lastname='Abercrombie' firstname='Neil' birthday='1938-06-26' gender='M' pvsid='26827' osid='N00007665' bioguideid='A000014' metavidid='Neil_Abercrombie' youtubeid='hawaiirep1' name='Rep. Neil Abercrombie [D, HI-1]' title='Rep.' state='HI' district='1' >
<role type='rep' startdate='1985-01-03' enddate='1986-10-18' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1991-01-03' enddate='1992-10-09' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1993-01-05' enddate='1994-12-01' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1995-01-04' enddate='1996-10-04' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1997-01-07' enddate='1998-12-19' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='1999-01-06' enddate='2000-12-15' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='2001-01-03' enddate='2002-11-22' party='Democrat' state='HI' district='1' />
<role type='rep' startdate='2003-01-07' enddate='2004-12-09' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
<role type='rep' startdate='2005-01-04' enddate='2006-12-08' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
<role type='rep' startdate='2007-01-04' enddate='2009-01-03' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
<role type='rep' startdate='2009-01-06' enddate='2010-03-01' party='Democrat' state='HI' district='1' url='http://www.house.gov/abercrombie' />
</person>
I don't need any fancy logic to be performed on the attributes. At the beginning of my script, I check to see if I've already processed this particular record (based on the 'id' attribute), and then I grab pretty much every attribute and parse it into my db. But there are two problems:
1) When I use this:
$p->getAttribute('id')
to get the 'id', it gives it to me twice, separate by as many line breaks as there are subelements in the element (I think the comment on this page speaks to that, but I'm not sure what to do about it).
2) How do I access the attributes of each subelement sequentially? This:
$p->getAttribute('startdate')
gives me every 'startdate' value separated by multiple line breaks. I just need to grab the id of the element and then cycle through each of the 'role' subelements.
Any ideas?
For edification, here's the super-simple controller I have so far:
$f = base_url().'data/people.xml';
$p = new XMLReader;
$p->open($f);
while($p->read())
{
if($this->_notImported('govtrack',$p->getAttribute('id')))
{
// here I just grab the attributes, put them into arrays to insert, like so:
$insert = array('indiv_name' => $full_name,
'indiv_first' => ($p->getAttribute(‘firstname’)),
'indiv_last' => ($p->getAttribute(‘lastname’)),
'indiv_middle' => ($p->getAttribute(‘middlename’)),
'indiv_other' => ($p->getAttribute(‘namemod’)),
'indiv_full_name' => $full_name,
'indiv_title' => ($p->getAttribute(‘title’)),
'indiv_dob' => ($p->getAttribute(‘birthday’)),
'indiv_gender' => ($p->getAttribute(‘gender’)),
'indiv_religion' => ($p->getAttribute(‘religion’)),
'indiv_url' => ($url)
);
For the element, this is not as difficult, but I don't know how to cycle through each of the 'role' subelements and grab the attributes separately.