Vim help says that:
\1 Matches the same string that was matched by */\1* *E65* the first sub-expression in \( and \). {not in Vi} Example: "\([a-z]\).\1" matches "ata", "ehe", "tot", etc.
It looks like the backreference can be used in search pattern. I started playing with it and I noticed behavior that I can't explain. This is my file:
<paper-input label="Input label"> Some text </paper-input>
<paper-input label="Input label"> Some text </paper-inputa>
<aza> Some text </az>
<az> Some text </az>
<az> Some text </aza>
I wanted to match the lines where the opening and closing tags are matching i.e.:
<paper-input label="Input label"> Some text </paper-input>
<az> Some text </az>
And my test regex is:
%s,<\([^ >]\+\).*<\/\1>,,gn
But this matches lines: 1
, 3
and 4
. Same thing with sed:
$ sed -ne 's,<\([^ >]\+\).*<\/\1>,\0,p' file
<paper-input label="Input label"> Some text </paper-input>
<aza> Some text </az>
<az> Some text </az>
This: <\([^ >]\+\)
should be greedy and when trying to match it without \1
at the end then all the groups are correct. But when I add \1
it seems that <\([^ >]\+\)
becomes not greedy and it tries to force the match in 3rd line. Can someone explain why it matches 3rd
line:
<aza> Some text </az>
This is also a regex101 demo
NOTE This is not about the regex itself (probably there is other way to do it) but about the behavior of that regex.
Currently the reason why line 3 (
<aza>
) is showing up as a match is that the.*
term in your regex can match across multiple lines. So line 3 matches because line 5 has the closing tag. To correct this, force the regex to find a matching closing tag on the same line only:You need to add
\>
to indicate end of word. There may be other solutions with 0-width patterns, but it'll complicates things.Also, your separator is
,
, not/
Which gives:
To understand why your regex behaves the way it does you need to understand what a backtracking regex engine does.
The engine will greedily match and consume as many characters as it can. But if it doesn't find a match it goes back and tries to find a different match that still satisfies the pattern.
For line three
<aza> Some text </az>
,The regex engine looks at
\1 = aza
. and sees if.*</aza>
matches the rest of the string. It doesn't so it chooses something else for\1
. The next time it chooses\1 = az
and sees if.*</az>
matches the rest of the string and it does. So the string matches(This is a simplified version. I skipped over the fact that
.*
can potentially do a lot of backtracking itself)Solving it is as easy as adding an anchor in the regex stops the regex from searching for other values that could satisfy
\1
. In this case matching a space or>
is sufficient.