I'm working on processing a XML file I receive from a partner. I do not have any influence on changing the makeup of this xml file. An extract of the XML is:
<?xml version="1.0" encoding="UTF-8"?>
<objects>
<object>
<id>VW-XJC9</id>
<name>Name</name>
<type>House</type>
<description>
<![CDATA[<p>some descrioption of the house</p>]]> </description>
<localcosts>
<localcost>
<type>mandatory</type>
<name>What kind of cost</name>
<description>
<![CDATA[Some text again, different than the first tag]]>
</description>
</localcost>
</localcosts>
</object>
</objects>
The reason I use Twig is that this XML is about 11GB big, about 100000 different objects) . The problem is when I reach the localcosts part, the 3 fields (type, name and description) are skipped, probably because these names are already used before.
The code I use to go through the xml file is as follows:
my $twig= new XML::Twig( twig_handlers => {
id => \&get_ID,
name => \&get_Name,
type => \&get_Type,
description => \&get_Description,
localcosts => \&get_Localcosts
});
$lokaal="c:\\temp\\data3.xml";
getstore($xml, $lokaal);
$twig->parsefile("$lokaal");
sub get_ID { my( $twig, $data)= @_; $field[0]=$data->text; $twig->purge; }
sub get_Name { my( $twig, $data)= @_; $field[1]=$data->text; $twig->purge; }
sub get_Type { my( $twig, $data)= @_; $field[3]=$data->text; $twig->purge; }
sub get_Description { my( $twig, $data)= @_; $field[8]=$data->text; $twig->purge; }
sub get_Localcosts{
my ($t, $item) = @_;
my @localcosts = $item->children;
for my $localcost ( @localcosts ) {
print "$field[0]: $localcost->text\n";
my @costs = $localcost->children;
for my $cost (@costs) {
$Type =$cost->text if $cost->name eq q{type};
$Name =$cost->text if $cost->name eq q{name};
$Description=$cost->text if $cost->name eq q{description};
print "Fields: $Type, $Name, $Description\n";
}
}
$t->purge;
}
when I run this code, the main fields are read without issues, but when the code arrives at the 'localcosts' part, the second for-next loop is not executed. When I change the field names in the xml to unique ones, this code works perfectly.
Can someone help me out?
Thanks
If you want the handlers for type, name and desctiption only be triggered in the object tag, specify the path:
my $twig = new XML::Twig( twig_handlers => {
id => \&get_ID,
'object/name' => \&get_Name,
'object/type' => \&get_Type,
'object/description' => \&get_Description,
localcosts => \&get_Localcosts
});
The problem is that the id
, name
, type
and description
handlers are being executed for both occurrences. You will find that the contents of the @fields
is from the localcost
values, as the data from the object
values has been overwritten.
Also, in handling the localcost
elements, the handlers have done a $twig->purge, which removes the data from memory. So when the localcosts
handler is called it finds the element empty
I think the easiest way to do this is to write a single handler that processes each object
node in one go and then purges it
This program demonstrates. Note that I have used Data::Dumper
only so that you can see the contents of @fields
once it has been populated
It is very important that you use strict
and use warnings
at the top of every Perl program, especially if you are asking for help with it. It is a simple measure that can reveal many straightforward errors that you may otherwise waste a lot of time searching for
Note also that the "indirect object" form of method calls is discouraged: you should write XML::Twig->new(...)
instead of new XML::Twig (...)
.
And if you use single quotes instead of double quotes then a backslash inside a string doesn't need to be doubled-up unless it is the last character of the string. But Perl is quite happy if you use forward slashes as a path separator, even on Windows
I hope this helps
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;
$Data::Dumper::Useqq = 1;
my $twig= XML::Twig->new( twig_handlers => { object => \&get_Object });
my $lokaal = 'c:\temp\data3.xml';
my @fields;
$twig->parsefile($lokaal);
sub get_Object {
my ($twig, $object) = @_;
$fields[0] = $object->findvalue('id');
$fields[1] = $object->findvalue('name');
$fields[3] = $object->findvalue('type');
$fields[8] = $object->findvalue('description');
print Dumper \@fields;
my @localcosts = $object->findnodes('localcosts/localcost');
for my $localcost (@localcosts) {
my $type = $localcost->findvalue('type');
my $name = $localcost->findvalue('name');
my $description = $localcost->findvalue('description');
print "$type, $name, $description\n";
}
$twig->purge;
}
output
$VAR1 = [
"VW-XJC9",
"Name",
undef,
"House",
undef,
undef,
undef,
undef,
"<p>some descrioption of the house</p> "
];
mandatory, What kind of cost, Some text again, different than the first tag
As Borodin said, if you have handlers on name
, type
and description
, and you call $twig->purge
at the end of each handler, then the elements are removed from the tree. You could set a handler on object
, that only does a $twig->purge
call, and you would be OK.
You don't need to call purge
"too often", just make sure you call it at a low enough level so you don't use too much memory. There is no point really in calling it for each single leaf element.
That's a common mistake, one that I make myself quite often ;--(.