awk : parse and write to another file

2019-09-21 05:43发布

问题:

I have records in XML file like below. I need to search for <keyword>SEARCH</keyword> and if present then I need to take the entire record and write to another file.(starting from <record> to </record>)

Below is my awk code which is inside loop. $1 holds line by line value of each record.

if(index($1,"SEARCH")>0)
{
print $1>> "output.txt"
}

This logic has two problems,

  1. It is writing to output.txt file, only <keyword>SEARCH</keyword> element and not the whole record(starting from <record> to </record>)
  2. SEARCH can also be present in <detail> tag. This code will even write that tag to output.txt

XML File:

<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<names>
<first_name/>
<last_name></last_name>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</record>
<record category="abc">
<person ssn="" e-i="F">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<names>
<first_name/>
<last_name></last_name>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>DONTSEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is not present in abc for xyz reason</detail>
</external_sources>
</details>
</record>

回答1:

Use GNU awk for multi-char RS:

$ awk -v RS='</record>\n' '{ORS=RT} /<keyword>SEARCH<\/keyword>/' file 
<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<names>
<first_name/>
<last_name></last_name>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</record>

If you need to search for any of multiple keywords then simply list them as such:

$ awk -v RS='</record>\n' '{ORS=RT} /<keyword>(SEARCH1|SEARCH2|SEARCH3)<\/keyword>/' file 


回答2:

$ cat x.awk
/<record / { i=1 }
i { a[i++]=$0 }
/<\/record>/ {
    if (found) {
        for (i=1; i<=length(a); ++i) print a[i] > "output.txt"
    }
    i=0;
    found=0
}
/<keyword>SEARCH<\/keyword>/ { found=1 }


$ awk -f x.awk x.xml

$ cat output.txt
<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<names>
<first_name/>
<last_name></last_name>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</record>


回答3:

You seem to have cross posted this question from Unix & Linux - I give the same answer here as I did there:

I'm going to assume that what you've posted is a sample, because it isn't valid XML. If this assumption isn't valid, my answer doesn't hold... but if that is the case, you really need to hit the person who gave you the XML with a rolled up copy of the XML spec, and demand they 'fix it'.

But really - awk and regular expressions are not the right tool for the job. An XML parser is. And with a parser, it's absurdly simple to do what you want:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

#parse your file - this will error if it's invalid. 
my $twig = XML::Twig -> new -> parsefile ( 'your_xml' );
#set output format. Optional. 
$twig -> set_pretty_print('indented_a');

#iterate all the 'record' nodes off the root. 
foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   #if - beneath this record - we have a node anywhere (that's what // means)
   #with a tag of 'keyword' and content of 'SEARCH' 
   #print the whole record. 
   if ( $record -> get_xpath ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> print;
   }
}

xpath is quite a lot like regular expressions - in some ways - but it's more like a directory path. That means it's context aware, and can handle XML structures.

In the above: ./ means 'below current node' so:

$twig -> get_xpath ( './record' )

Means any 'top level' <record> tags.

But .// means "at any level, below current node" so it'll do it recursively.

$twig -> get_xpath ( './/search' ) 

Would get any <search> nodes at any level.

And the square brackets denote a condition - that's either a function (e.g. text() to get the text of the node) or you can use an attribute. e.g. //category[@name] would find any category with a name attribute, and //category[@name="xyz"] would filter those further.

XML used for testing:

<XML>
<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
<record category="abc">
<person ssn="" e-i="F">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>DONTSEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is not present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
</XML>

Output:

 <record category="xyz">
    <person
        e-i="E"
        ssn="">
      <title xsi:nil="true" />
      <position xsi:nil="true" />
      <details>
        <names>
          <first_name/>
          <last_name></last_name>
        </names>
        <aliases>
          <alias>CDP</alias>
        </aliases>
        <keywords>
          <keyword xsi:nil="true" />
          <keyword>SEARCH</keyword>
        </keywords>
        <external_sources>
          <uri>http://www.google.com</uri>
          <detail>SEARCH is present in abc for xyz reason</detail>
        </external_sources>
      </details>
    </person>
  </record>

Note - the above just prints the record to STDOUT. That's actually... in my opinion, not such a great idea. Not least because - it doesn't print the XML structure, and so it isn't actually 'valid' XML if you've more than one record (there's no "root" node).

So I would instead - to accomplish exactly what you're asking:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

my $twig = XML::Twig -> new -> parsefile ('your_file.xml'); 
$twig -> set_pretty_print('indented_a');

foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   if ( not $record -> findnodes ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> delete;
   }
}

open ( my $output, '>', "output.txt" ) or die $!;
print {$output} $twig -> sprint;
close ( $output ); 

This instead - inverts the logic, and deletes (from the parsed data structure in memory) the records you don't want, and prints the whole new structure (including XML headers) to a new file called "output.txt".