I'm working on compiling a table of cases for a legal book. I've converted it to HTML so I can use the tags for search and replace operations, and I'm currently working in Kate. The text refers to the names of cases and the citations for the cases are in the footnotes, e.g.
<i>Smith v Jones</i>127 ......... [other stuff including newline characters].......</br>127 (1937) 173 ER 406;
I've been able to get lookahead working in Kate, using:
<i>.*</i>([0-9]{1,4}) .+<br/>\1 .*<br/>
...but I've run into greediness problems.
The text is a mess, so I really need to find matches step by step rather than relying on a batch process.
Is there a Linux (or Windows) text editor that supports both lookahead AND non-greedy operators, or am I going to have to try grep or sed?
I'm not familiar with Kate, but it seems to use QRegExp, which is incompatible with other Perl-like regex flavors in many important ways. For example, most flavors allow you make individual quantifiers non-greedy by appending a question mark (e.g.
.*
=>.+?
), but in QRegExp you can only make them all greedy or all non-greedy. What's worse, it looks like Kate doesn't even let you do that--via aNon-Greedy
checkbox, for example.But it's best not to rely on non-greedy quantifiers all time anyway. For one thing, they don't guarantee the shortest possible match, as many people say. You should get in the habit of being more specific about what should and should not be matched, when that's not too difficult. For example, if the section you want to match doesn't contain any tags other than the ones in your sample string, you can do this:
The advantage of using
[^<]*
instead of.*
is that it will never try to match anything after the next<
..*
will always grab the rest of the document at first, only to backtrack almost all the way to the starting point. The non-greedy version,.*?
, will initially match only to the next<
, but if the match attempt fails later on it will go ahead and consume the<
and beyond, eventually to consume the whole document.If there can be other tags, you can use
[^<]*(<(?!br/>)[^<]*)*
instead. It will consume any characters that are not<
, or<
if it's not the beginning of a<br/>
tag.By the way, what you're calling a lookahead (I'm assuming you mean
\1
) is really a backreference. The(?!br/>)
in my regex is an example of lookaheads--in this case a negative lookahead. The Kate/QRegExp docs claim that lookaheads are supported but non-capturing groups-- e.g.(?:...)
--aren't, which is why used all capturing groups in that last regex.If you have the option of switching to a different editor, I strongly recommend that you do so. My favorite is EditPad Pro; it has the best regex support I've ever seen in an editor.