--Edit-- The current answers have some useful ideas but I want something more complete that I can 100% understand and reuse; that's why I set a bounty. Also ideas that work everywhere are better for me than not standard syntax like \K
This question is about how I can match a pattern except some situations s1 s2 s3. I give a specific example to show my meaning but prefer a general answer I can 100% understand so I can reuse it in other situations.
Example
I want to match five digits using \b\d{5}\b
but not in three situations s1 s2 s3:
s1: Not on a line that ends with a period like this sentence.
s2: Not anywhere inside parens.
s3: Not inside a block that starts with if(
and ends with //endif
I know how to solve any one of s1 s2 s3 with a lookahead and lookbehind, especially in C# lookbehind or \K
in PHP.
For instance
s1 (?m)(?!\d+.*?\.$)\d+
s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+
But the mix of conditions together makes my head explode. Even more bad news is that I may need to add other conditions s4 s5 at another time.
The good news is, I don't care if I process the files using most common languages like PHP, C#, Python or my neighbor's washing machine. :) I'm pretty much a beginner in Python & Java but interested to learn if it has a solution.
So I came here to see if someone think of a flexible recipe.
Hints are okay: you don't need to give me full code. :)
Thank you.
Same as @zx81's
(*SKIP)(*F)
but with using a negative lookahead assertion.DEMO
In python, i would do easily like this,
Output:
Your requirement that it's not inside parens in impossible to satify for all cases. Namely, if you can somehow find a
(
to the left and)
to the right, it doesn't always mean you are inside parens. Eg.(....) + 55555 + (.....)
- not inside parens yet there are(
and)
to left and rightNow you might think yourself clever and look for
(
to the left only if you don't encounter)
before and vice versa to the right. This won't work for this case:((.....) + 55555 + (.....))
- inside parens even though there are closing)
and(
to left and to right.It is impossible to find out if you are inside parens using regex, as regex can't count how many parens have been opened and how many closed.
Consider this easier task: using regex, find out if all (possibly nested) parens in a string are closed, that is for every
(
you need to find)
. You will find out that it's impossible to solve and if you can't solve that with regex then you can't figure out if a word is inside parens for all cases, since you can't figure out at a some position in string if all preceeding(
have a corresponding)
.Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.
First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.
Surprise
Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...
Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.
A Better-Known Variation
There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).
Key Fact
In fact, the trick is to match the various contexts we don't want (chaining these contexts using the
|
OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.The general recipe is
This will match
Not_this_context
, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.In your case, with your digits and your three contexts to ignore, we can do:
Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a
|
)The whole expression can be written like this:
See this demo (but focus on the capture groups in the lower right pane.)
If you mentally try to split this regex at each
|
delimiter, it is actually only a series of four very simple expressions.For flavors that support free-spacing, this reads particularly well.
This is exceptionally easy to read and maintain.
Extending the regex
When you want to ignore more situations s4 and s5, you add them in more alternations on the left:
How does this work?
The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".
The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.
I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.
Debuggex Demo
Perl/PCRE Variation
In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as @CasimiretHippolyte and @HamZa. It is:
In your case:
This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain
whatYouWant
Note that
(*F)
,(*FAIL)
and(?!)
are all the same thing. If you wanted to be more obscure, you could use(*SKIP)(?!)
demo for this version
Applications
Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.
<a stuff...>...</a>
?<i>
tag or a javascript snippet (more conditions)?How to Program the Group 1 Captures
You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.
If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.
Alternatives
Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, indluding multiple conditions. In my view, neither is nearly as attractive as the
s1|s2|s3|(whatYouWant)
recipe, if only because clarity always wins out.1. Replace then Match.
A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of
@@@
. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive@@@
strings.2. Lookarounds.
Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental
regex
module to replacere
in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.Recycling the regex you had for s3 in C#, the whole pattern would look like this.
But by now you know I'm not recommending this, right?
Deletions
@HamZa and @Jerry have suggested I mention an additional trick for cases when you seek to just delete
WhatYouWant
. You remember that the recipe to matchWhatYouWant
(capturing it into Group 1) wass1|s2|s3|(WhatYouWant)
, right? To delete all instance ofWhatYouWant
, you change the regex toFor the replacement string, you use
$1
. What happens here is that for each instance ofs1|s2|s3
that is matched, the replacement$1
replaces that instance with itself (referenced by$1
). On the other hand, whenWhatYouWant
is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you @HamZa and @Jerry for suggesting this wonderful addition.Replacements
This brings us to replacements, on which I'll touch briefly.
(*SKIP)(*F)
variation mentioned above to match exactly what you want, and do a straight replacement.Have fun!
No, wait, there's more!
Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.
Hans if you don't mind I used your neighbor's washing machine called perl :)
Edited: Below a pseudo code:
Given the file input.txt:
And the script validator.pl:
Execution:
Not sure if this would help you or not, but I am providing a solution considering the following assumptions -
However I considered also the following -
if(
blocks.Ok here is the solution -
I used C# and with it MEF (Microsoft Extensibility Framework) to implement the configurable parsers. The idea is, use a single parser to parse and a list of configurable validator classes to validate the line and return true or false based on the validation. Then you can add or remove any validator anytime or add new ones if you like. So far I have already implemented for S1, S2 and S3 you mentioned, check classes at point 3. You have to add classes for s4, s5 if you need in future.
First, Create the Interfaces -
Then comes the file reader and checker -
Then comes the implementation of individual checkers, the class names are self explanatory, so I don't think they need more descriptions.
The program -
For testing I took @Tiago's sample file as
Test.txt
which had the following lines -Gives the output -
Don't know if this would help you or not, I do had a fun time playing with it.... :)
The best part with it is that, for adding a new condition all you have to do is provide an implementation of
IPatternMatcher
, it will automatically get called and thus will validate.Do three different matches and handle the combination of the three situations using in-program conditional logic. You don't need to handle everything in one giant regex.
EDIT: let me expand a bit because the question just became more interesting :-)
The general idea you are trying to capture here is to match against a certain regex pattern, but not when there are certain other (could be any number) patterns present in the test string. Fortunately, you can take advantage of your programming language: keep the regexes simple and just use a compound conditional. A best practice would be to capture this idea in a reusable component, so let's create a class and a method that implement it:
So above, we set up the search string (the five digits), multiple exception strings (your s1, s2 and s3), and then try to match against several test strings. The printed results should be as shown in the comments next to each test string.