Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc
, 123
and xyz
that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc
, contains 123
somewhere in the middle, ends with xyz
, and there are no other instances of abc
or xyz
in the substring besides the start and the end.
Is this possible with regular expressions?
You need a tempered greedy token:
See the regex demo
To make sure it matches across lines, use
re.DOTALL
flag when compiling the regex.Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc
- matchabc
(?:(?!abc|xyz|123).)*
- match any character that is not the starting point for aabc
,xyz
or123
character sequences123
- a literal string123
(?:(?!abc|xyz).)*
- any character that is not the starting point for aabc
orxyz
character sequencesxyz
- a trailing substringxyz
See the diagram below (if
re.S
is used,.
will meanAnyChar
):See the Python demo:
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
I imagine something quite similar is simple to do in other environments.
You could use lookaround.
(I've not tested it.)
Using PCRE a solution would be:
This using
m
flag. If you want to check only from start and end of a line add^
and$
at beginning and end respectivelyDebuggex Demo