Grep for multiple patterns in a file

2019-09-12 00:01发布

问题:

I'd like to count number of xml nodes in my xml file(grep or somehow).

....
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
...
<countryCode>CAN</countryCode>
<someNode>USA</someNode>
<countryCode>CAN</countryCode>
<someNode>Otherone</someNode>
<countryCode>GBR</countryCode>
...

How to get count of individual countries like CAN = 3, USA = 1, GBR = 2? Without passing in the names of the countries there might be some more countries?

Update:

There are other nodes beside countrycode

回答1:

My simple suggestion would be to use sort and uniq -c

$ echo '<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>' | sort | uniq -c
      3 <countryCode>CAN</countryCode>
      2 <countryCode>GBR</countryCode>
      1 <countryCode>USA</countryCode>

Where you'd pipe in the output of your grep instead of an echo. A more robust solution would be to use XPath. If youre XML file looks like

<countries>
  <countryCode>GBR</countryCode>
  <countryCode>USA</countryCode>
  <countryCode>CAN</countryCode>
  <countryCode>CAN</countryCode>
  <countryCode>CAN</countryCode>
  <countryCode>GBR</countryCode>
</countries>

Then you could use:

$ xpath -q -e '/countries/countryCode/text()'  countries.xml  | sort | uniq -c
      3 CAN
      2 GBR
      1 USA

I say it's more robust because using tools designed for parsing flat text will be inherently flaky for dealing with XML. Depending on the context of the original XML file, a different XPath query might work better, which would match them anywhere:

$ xpath -q -e '//countryCode/text()'  countries.xml  | sort | uniq -c
      3 CAN
      2 GBR
      1 USA


回答2:

grep can give a total count, but it doesn't do a per-pattern; for that you should use uniq -c:

$ uniq -c <(sort file)
  1 
  1  
  3 <countryCode>CAN</countryCode>
  2 <countryCode>GBR</countryCode>
  1 <countryCode>USA</countryCode>

If you want to get rid of the empty lines and tags, add sed:

$ sed -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
  3 CAN
  2 GBR
  1 USA

To delete lines that don't have a country code, add another command to sed:

$ sed -e '/countryCode/!d' -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
  3 CAN
  2 GBR
  1 USA


回答3:

quick and dirty (only based on your example text):

awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' file

test:

kent$  cat t.txt
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>

kent$  awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' t.txt 
USA 1
GBR 2
CAN 3


回答4:

sed -n "s/<countryCode>\(.*\)<\/countryCode>/\1/p"|sort|uniq -c


回答5:

cat dummy | sort |cut -c14-16 | sort |tail -6 |awk  '{col[$1]++} END {for (i in col) print i, col[i]}'

Dummy is ur file name and replace 6 in -6 with n-2(n - no of lines in ur data file)



回答6:

Something like this maybe:

grep -e 'regex' file.xml | sort | uniq -c

Of course you need to provide regex that matches your needs.



回答7:

If your file is set up as you had shown to us, awk can do it like:

awk -F '<\/?countryCode>' '{ a[$2]++} END { for (e in a) { printf("%s\t%i\n",e,a[e]) }' INPUTFILE

If there are more than one <countryCode> tag on a line, you can still set up some pipe to make it into one line, e.g.:

sed 's/<countryCode>/\n<countryCode>/g' INPUTFILE | awk ...

Note if the <countryCode> spans to multiple lines, it does not work as expected.

Anyway, I'd recommend to use xpath for this kind of task (perl's xml::xpath module has a CLI utility for this.



回答8:

Quick and simple:

grep countryCode ./file.xml | sort | uniq -c



标签: linux shell unix