Extract data from XML using ksh script

2019-06-12 00:42发布

The first question I asked on this topic was closed because of lack of info. So asking this again with some more details added.

I have to extract a value given in one tag from a xml file and I have to do it using ksh (I can solve this in perl but I have to do it ksh, cannot use third party tools like xmlsh)

sample.xml

<?xml version="1.0" standalone="yes" ?>
<parent_one>
  <parent_two>
    <Pool>
      <pool_name>ABC</pool_name>
      <percent_full>79</percent_full>
      <pool_state>Enabled</pool_state>
    </Pool>
    <Pool>
      <pool_name>DEF</pool_name>
      <percent_full>40</percent_full>
      <pool_state>Enabled</pool_state>
    </Pool>
    <Pool>
      <pool_name>XYZ</pool_name>
      <percent_full>40</percent_full>
      <pool_state>Disabled</pool_state>
    </Pool> 
    <Totals>
      <total_tracks>4546456</total_tracks>
      <percent_full>48</percent_full>
    </Totals>
  </parent_two>
</parent_one>

The ksh script should read sample.xml and print ABC, DEF from pool_name tag because the corresponding pool_state tag is enabled. It should not print XYZ because its pool_state tag is disabled.

The ksh script would read sample.xml and output the following

ABC

DEF

Is this feasible in ksh or do I have to use perl for this?

标签: xml parsing ksh
3条回答
我命由我不由天
2楼-- · 2019-06-12 01:16

The sane solution to this problem is to make a call out to xmllint --xpath, xqilla -p, or your favoriate Python/Ruby/Perl etc XML lib.

Otherwise you can have a look at Roland Mainz's XML examples and extend them for your purposes.

If you were really serious about this you would probably want to look into writing bindings for libxml2 for ksh. I don't think anybody has done this yet.

查看更多
贼婆χ
3楼-- · 2019-06-12 01:25

I've done quite a lot of parsing of odd format files with (n)awk. Technically, this could be done with just ksh, but awk (and perl) are easier...

The following sample makes use of the start, end construct in awk that will only process the lines between the start and end patterns. (In this case <Pool> and </Pool>.)

Other than that it's straightforward, using variables mimicking the xml elements for clarity.

awk '/<Pool>/,/<\/Pool>/ {
    if (/<pool_state>/) {
        pool_state=(/<pool_state>Enabled<\/pool_state>/)
    }
    if (/<pool_name>/) {
        if ( gsub(/.*<pool_name>|<\/pool_name>.*/,"") ) {
          pool_name=$0
        }
    }
    if (/<\/Pool>/) {
      if (pool_name && pool_state)
        print pool_name
      unset pool_name
      unset pool_state
    }
}' sample.xml

This code will fail horribly when the xml is malformed, when multiple Pool elements are listed on a single line, etc.

查看更多
▲ chillily
4楼-- · 2019-06-12 01:26

That being said (my comment about trying to parse XML without a proper XML parser), let's give it a shot using sed/awk, not pure ksh. Take this answer as the foundation, remove all <Pool></Pool> blocks which have pool_state set to Disabled, then get the lines containing pool_name and capture the value between the tags. If your xml file looks like your sample this should work, but will definitely break if it doesn't.

awk '
    /<Pool>/ { rec=""; f=1 }
    f {rec = rec $0 ORS}
    /<\/Pool>/ {
        if (f && (rec !~ "<pool_state>Disabled</pool_state>"))
            printf "%s", rec
            f=0
    }' sample.xml |
grep pool_name |
sed 's#.*>\([^<]*\)<.*#\1#g'

You could fit the whole thing into one awk script, but I figured this might be easier to follow (OK, I am being lazy).

查看更多
登录 后发表回答