In my home directory I have a folder drupal-6.14 that contains the Drupal platform.
From this directory I use the following command:
find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz
What this command does is gzips the folder drupal-6.14, excluding all subfolders of drupal-6.14/sites/ except sites/all and sites/default, which it includes.
My question is on the regular expression:
grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*'
The expression works to exclude all the folders I want excluded, but I don't quite understand why.
It is a common task using regular expressions to
Match all strings, except those that don't contain subpattern x. Or in other words, negating a subpattern.
I (think) I understand that the general strategy to solve these problems is the use of negative lookaheads, but I've never understood to a satisfactory level how positive and negative look(ahead/behind)s work.
Over the years, I've read many websites on them. The PHP and Python regex manuals, other pages like http://www.regular-expressions.info/lookaround.html and so forth, but I've never really had a solid understanding of them.
Could someone explain, how this is working, and perhaps provide some similar examples that would do similar things?
-- Update One:
Regarding Andomar's response: can a double negative lookahead be more succinctly expressed as a single positive lookahead statement:
i.e Is:
'drupal-6.14/(?!sites(?!/all|/default)).*'
equivalent to:
'drupal-6.14/(?=sites(?:/all|/default)).*'
???
-- Update Two:
As per @andomar and @alan moore - you can't interchange double negative lookahead for positive lookahead.
A negative lookahead says, at this position, the following regex can not match.
Let's take a simplified example:
The last example is a double negation: it allows a
b
followed byc
. The nested negative lookahead becomes a positive lookahead: thec
should be present.In each example, only the
a
is matched. The lookahead is only a condition, and does not add to the matched text.Lookarounds can be nested.
So this regex matches "drupal-6.14/" that is not followed by "sites" that is not followed by "/all" or "/default".
Confusing? Using different words, we can say it matches "drupal-6.14/" that is not followed by "sites" unless that is further followed by "/all" or "/default"
If you revise your regular expression like this:
...then it will match all inputs that contain
drupal-6.14/
followed bysites
followed by anything other than/all
or/default
. For example:Changing
?=
to?!
to match your original regex simply negates those matches:So, this simply means that
drupal-6.14/
now cannot be followed bysites
followed by anything other than/all
or/default
. So now, these inputs will satisfy the regex:But, what may not be obvious from some of the other answers (and possibly your question) is that your regex will also permit other inputs where
drupal-6.14/
is followed by anything other thansites
as well. For example:Conclusion: So, your regex basically says to include all subdirectories of
drupal-6.14
except those subdirectories ofsites
whose name begins with anything other thanall
ordefault
.