how to make portable regex?

2019-06-26 13:38发布

问题:

Which features of regular expressions are standard, and which are idiosyncratic ?
What should I do, and not do, if I want to use the same regex in different context, languages, platforms ?

回答1:

There is no standard, but if maximum portability is your goal you should stick to the features supported by JavaScript regexes. All of the other major flavors support everything JS does, with only minor variations here and there. For example, some only support the POSIX character-class notation ([:alpha:]), while others use the Unicode syntax (\p{Alpha}).

Probably the most troublesome variations are those that affect the dot (.) and the anchors (^ and $). For example, JavaScript has no DOTALL (or "single-line") mode, so to match anything including a newline you have to use a hack like [\s\S]. Meanwhile, Ruby has a DOTALL mode but calls it multiline mode--what everyone else calls "multiline" (^ and $ as line anchors) is how Ruby always works.

Be aware, too, of exactly what the dot doesn't match (in the default mode). Traditionally that was just the linefeed (\n), but more and more flavors are adopting (or at least approximating) the Unicode guidelines concerning line separators. For example, in Java the dot doesn't match any of [\r\n\u0085\u2028\u2029], while ^ and $ treat \r\n as a single separator and won't match between the two characters.

Note that I'm only talking about Perl-derived flavors, like Python, Ruby, PHP, JavaScript, etc.. It wouldn't make sense to inlcude GNU or POSIX based flavors like grep, awk, and MySQL; they tend to have fewer features, but that's not what you would choose them for anyway.

I'm also not including the XML Schema flavor; it's much more limited than JavaScript, but it's a specialized application. For example, it doesn't support the anchors (^, $, \A, \Z, etc.) because matches are always anchored at both ends.



回答2:

Here you can find a good reference. And here you have the best book I ever read about the subject. Then in this page, under language features (Part 1 & 2) you can see some differences