I need to get all text between <Annotation>
and </Annotation>
, where a word MATCH
occurs. How can I do it in VIM?
<Annotation about="MATCH UNTIL </Annotation> " timestamp="0x000463e92263dd4a" href=" 5raS5maS90ZWh0YXZha29rb2VsbWEvbGFza2FyaS8QyrqPk5L9mAI">
<Label name="las" />
<Label name="_cse_6sbbohxmd_c" />
<AdditionalData attribute="original_url" value="MATCH UNTIL </Annotation> " />
</Annotation>
<Annotation about="NO MATCH" href=" Cjl3aWtpLmhlbHNpbmtpLmZpL2Rpc3BsYXkvbWF0aHN0YXRLdXJzc2l0L0thaWtraStrdXJzc2l0LyoQh_HGoJH9mAI">
<Label name="_cse_6sbbohxmd_c" />
<Label name="courses" />
<Label name="kurssit" />
<AdditionalData attribute="original_url" value="NO MATCH" />
</Annotation>
<Annotation about="MATCH UNTIL </ANNOTATION> " score="1" timestamp="0x000463e90f8eed5c" href="CiZtYXRoc3RhdC5oZWx zaW5raS5maS90ZWh0YXZha29rb2VsbWEvKhDc2rv8kP2YAg">
<Label name="_cse_6sbbohxmd_c" />
<Label name="exercises_without_solutions" />
<Label name="tehtäväkokoelma" />
<AdditionalData attribute="original_url" value="MATCH UNTIL </ANNOTATION>" />
</Annotation>
Does it have to be done within vim? Could you cheat, and open a second window where you pipe something into more/less that tells you what line number to go to within vim?
-- edit --
I have never done a multi-line match/search in vi[m]. However, to cheat in another window:
perl -n -e 'if ( /<tag/ .. /<\/tag/)' -e '{ print "$.:$_"; }' file.xml | less
will show the elements/blocks for "tag" (or other longer matching names), with line numbers, in less, and you can then search for the other text within each block.
Close enough?
-- edit --
within "less", type
/MATCH
to search for occurrences of MATCH. On the left margin will be the line number where that instance (within the targeted element/tags) is.
within vi[m], type
:n
where "n" is the desired line number.
Of course, if what you really wanted to do was some kind of search/yank/replace, it's more complicated. At that point, awk / perl / ruby (or something similar which meets your tastes ... or xsl?) is really the tool you should be using for the transformation.
First, a disclaimer: Any attempt to slice and dice XML with regular expressions is fragile; a real XML parser would do better.
The pattern:
\(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\@=
Let's break it down...
Group 1 is <Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>
. It matches the start-tag of the Attribute element. Group 2, which is embedded in Group 1, matches an attribute and may be repeated 0 or more times.
Group 2 is \s*\w\+="[^"]\{-}"\s\{-}
. Most of these pieces are commonly used; the most unusual is \{-}
, which means non-greedy repetition (*?
in Perl-compatible regular expressions). The non-greedy whitespace match at the end is important for performance; without it, Vim will try every possible way to split the whitespace between attributes between the \s*
at the end of Group 2 and the \s*
at the beginning of the next occurrence of Group 2.
Group 1 is followed by \@<=
. This is a zero-width positive look-behind. It prevents the start-tag from being included in the matched text (e.g., for s///).
Group 3 is \(<\/Annotation\)\@!\_.
. It includes Group 4, which matches the beginning of the Attribute end-tag. The \@!
is a zero-width negative look-ahead and \_.
matches any character (including newlines). Together, this groups matches at any character except where the Attribute end-tag starts. Group 3 is followed by a non-greedy repetition marker \{-}
so that it matches the smallest block of text before MATCH. If you were to use \_.
instead of Group 3, the matched text could include the end-tag of an Annotation element that did not include MATCH and continue through into the next Annotation element with MATCH. (Try it.)
The next bit is straightforward: Find MATCH and a minimal number of other characters before the end-tag.
Group 5 is easy: It's the end tag. \@=
is a zero-width positive look-ahead, which is included here for the same reason as the \@<=
for the start-tag. We have to repeat <\/Attribute
rather than use \4
because groups with zero-width modifiers aren't captured.