Remove XML comments using Regex in bash

2020-03-05 04:50发布

I want to remove XML comments in bash using regex (awk, sed, grep...) I have looked at other questions about this but they are missing something. Here's my xml code

<Table>
    <!--
   to be removed bla bla bla bla bla bl............

    removeee

    to be removeddddd
    -->

<row>
        <column name="example"  value="1" ></column>
    </row>
</Table>

So I'm comparing 2 xml files but I don't want the comparison to take into account the comments. I do this

diff file1.xml file2.xml | sed '/<!--/,/-->/d'

but that only removes the line that starts with <!-- and the last line. It does not remove all the lines in between.

标签: xml regex bash
4条回答
淡お忘
2楼-- · 2020-03-05 05:22

The most simple solution to remove all comments from a textfile I could come up with is:

sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'

To explain:

The sed will put in a null char like this:

<Table>
    \0<!--
   to be removed bla bla bla bla bla bl............

    removeee

    to be removeddddd
    -->\0

<row>
        <column name="example"  value="1" ></column>
    </row>
</Table>

Than the grep -z will treat that character as "line separator"

  • <Table>\n
  • <!--\n to be removed bla bla bla bla bla bl............\n\n removeee\n\n to be removeddddd\n -->
  • \n\n<row>\n <column name="example" value="1" ></column>\n </row>\n</Table>\n

grep -v will remove the middle part.

And finally tr -d will remove the \0 again.


In this case it should be applied to both files before comparing e.g.:

diff <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file1.xml | grep -zv '^<!--' | tr -d '\0') <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file2.xml | grep -zv '^<!--' | tr -d '\0')

or more readable with a function:

stripcomments() {cat "$@" | sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'}

diff <(stripcomments file1.xml) <(stripcomments file2.xml)

In theory there might be some issues with CDATA blocks, as they can be used to have unbalanced comments, and there is a higher probability of them having important null-characters, but I have never seen such an xml file in real life.

So for most valid xml-files this should work.

查看更多
放我归山
3楼-- · 2020-03-05 05:32

You can use the pair 'perl-xmllint' to get this job done :

cat yourFile.xml | perl -e 'while (<>) { next if (/Start.*End/ );if (/Start/) { while (<>) {last if (/End/) }}else {print "$_"; }} ' | xmllint --format -

With Start = Your starting comment (in our case <!--) End = Your ending comment (in our case -->)

I tried to use grep -vP without any good results because I did not find how to tell grep to understand the dot as new lines (the s modifier).

查看更多
在下西门庆
4楼-- · 2020-03-05 05:34
xmlstarlet ed -d '//comment()' file.xml
查看更多
三岁会撩人
5楼-- · 2020-03-05 05:40

In the end, you're going to have to recommend to your client/friend/instructor that they need to install some kind of XML processor. xmlstarlet is a good command line tool, but there are any number (or at least some number greater than 2) of implementations of XSLT which can be compiled for any standard Unix, and in most cases also for Windows. You really cannot do much XML processing with regex-based tools, and whatever you do will be hard to read, harder to maintain, and likely to fail on corner cases, sometimes with disastrous consequences.

I haven't spent a lot of time polishing or reviewing the following little awk program. I think it will remove comments from compliant xml documents. Note that the following comment is not compliant:

<!-- XML comments cannot include -- so this comment is illegal -->

and it will not be treated correctly by my script.

The following is also illegal, but since I've seen it in the wild and it wasn't hard to deal with, I did so:

<!-------------- This comment is ill-formed but... -------------->

Here it is. No guarantees. I know that it's hard to read, and I wouldn't want to maintain it. It may well fail on arbitrary corner cases.

awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
     in_comment{next}
     {gsub(/<!--+([^-]|-[^-])*--+>/,"");
      in_comment=sub(/<!--+.*/,"");
      print}'
查看更多
登录 后发表回答