Assuming you have the following text:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam Lorem! nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At Lorem, vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
And you want to match any "lorem" keyword with the constraint that it must not be a substring of a word. Therefore I am checking if it ends/starts with a blank or if it's the end/start of the string, with:
/(^|\s)(lorem)(?=\s|$)/gmi
That works fine. However, I want to extend this functionality to find also matches that end with any special character like ,
or %
(not limited to those) and not just blanks. The issue I'm having with that is that there seems to be no character class to match any special characters and I can't use \w
or \W
as they would also match diacritics as non-word character (even if they are word characters).
So I'm asking myself how to achieve this!? Is there a way to specify the range for any non-word character, that will not include diacritic characters?
Note that I'm not able to use a RegExp extender plugin to allow searching with unicode support.
Example of my situation: Demo.
You may use XRegExp as source of a custom word boundary:
Here,
(?:^|[^_0-9" + pL + "])
and(?![_0-9" + pL + "])
act as word boundaries. The first non-capturing group checks the position at the start of the string, or if a character other than_
, digit or a Unicode letter is matched. The lookahead makes sure there is no_
, digit or a Unicode letter is present after theword
.