I'm working on processing a XML file I receive from a partner. I do not have any influence on changing the makeup of this xml file. An extract of the XML is:
<?xml version="1.0" encoding="UTF-8"?>
<objects>
<object>
<id>VW-XJC9</id>
<name>Name</name>
<type>House</type>
<description>
<![CDATA[<p>some descrioption of the house</p>]]> </description>
<localcosts>
<localcost>
<type>mandatory</type>
<name>What kind of cost</name>
<description>
<![CDATA[Some text again, different than the first tag]]>
</description>
</localcost>
</localcosts>
</object>
</objects>
The reason I use Twig is that this XML is about 11GB big, about 100000 different objects) . The problem is when I reach the localcosts part, the 3 fields (type, name and description) are skipped, probably because these names are already used before.
The code I use to go through the xml file is as follows:
my $twig= new XML::Twig( twig_handlers => {
id => \&get_ID,
name => \&get_Name,
type => \&get_Type,
description => \&get_Description,
localcosts => \&get_Localcosts
});
$lokaal="c:\\temp\\data3.xml";
getstore($xml, $lokaal);
$twig->parsefile("$lokaal");
sub get_ID { my( $twig, $data)= @_; $field[0]=$data->text; $twig->purge; }
sub get_Name { my( $twig, $data)= @_; $field[1]=$data->text; $twig->purge; }
sub get_Type { my( $twig, $data)= @_; $field[3]=$data->text; $twig->purge; }
sub get_Description { my( $twig, $data)= @_; $field[8]=$data->text; $twig->purge; }
sub get_Localcosts{
my ($t, $item) = @_;
my @localcosts = $item->children;
for my $localcost ( @localcosts ) {
print "$field[0]: $localcost->text\n";
my @costs = $localcost->children;
for my $cost (@costs) {
$Type =$cost->text if $cost->name eq q{type};
$Name =$cost->text if $cost->name eq q{name};
$Description=$cost->text if $cost->name eq q{description};
print "Fields: $Type, $Name, $Description\n";
}
}
$t->purge;
}
when I run this code, the main fields are read without issues, but when the code arrives at the 'localcosts' part, the second for-next loop is not executed. When I change the field names in the xml to unique ones, this code works perfectly.
Can someone help me out?
Thanks
If you want the handlers for type, name and desctiption only be triggered in the object tag, specify the path:
As Borodin said, if you have handlers on
name
,type
anddescription
, and you call$twig->purge
at the end of each handler, then the elements are removed from the tree. You could set a handler onobject
, that only does a$twig->purge
call, and you would be OK.You don't need to call
purge
"too often", just make sure you call it at a low enough level so you don't use too much memory. There is no point really in calling it for each single leaf element.That's a common mistake, one that I make myself quite often ;--(.
The problem is that the
id
,name
,type
anddescription
handlers are being executed for both occurrences. You will find that the contents of the@fields
is from thelocalcost
values, as the data from theobject
values has been overwritten.Also, in handling the
localcost
elements, the handlers have done a $twig->purge, which removes the data from memory. So when thelocalcosts
handler is called it finds the element emptyI think the easiest way to do this is to write a single handler that processes each
object
node in one go and then purges itThis program demonstrates. Note that I have used
Data::Dumper
only so that you can see the contents of@fields
once it has been populatedIt is very important that you
use strict
anduse warnings
at the top of every Perl program, especially if you are asking for help with it. It is a simple measure that can reveal many straightforward errors that you may otherwise waste a lot of time searching forNote also that the "indirect object" form of method calls is discouraged: you should write
XML::Twig->new(...)
instead ofnew XML::Twig (...)
.And if you use single quotes instead of double quotes then a backslash inside a string doesn't need to be doubled-up unless it is the last character of the string. But Perl is quite happy if you use forward slashes as a path separator, even on Windows
I hope this helps
output