Regex to match specific functions and their argume

2020-03-31 02:57发布

问题:

I'm working on a gettext javascript parser and I'm stuck on the parsing regex.

I need to catch every argument passed to a specific method call _n( and _(. For example, if I have these in my javascript files:

_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls.. 

This refs this documentation: http://poedit.net/trac/wiki/Doc/Keywords

I'm planning in doing it in two times (and two regex):

  1. catch all function arguments for _n( or _( method calls
  2. catch the stringy ones only

Basically, I'd like a Regex that could say "catch everything after _n( or _( and stop at the last parenthesis ) actually when the function is done. I dunno if it is possible with regex and without a javascript parser.

What could also be done is "catch every "string" or 'string' after _n( or _( and stop at the end of the line OR at the beginning of a new _n( or _( character.

In everything I've done I get either stuck on _( "one (optional)" ); with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) with two calls on the same line.

Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one

回答1:

Note: Read this answer if you're not familiar with recursion.

Part 1: match specific functions

Who said that regex can't be modular? Well PCRE regex to the rescue!

~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
_n?                               # Match _ or _n
\s*                               # Optional white spaces
(?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
~sx

The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.

Online regex demo Online php demo

Part 2: getting rid of opening & closing brackets

Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:

~           # Delimiter
^           # Assert begin of string
\(          # Match an opening bracket
\s*         # Match optional whitespaces
|           # Or
\s*         # Match optional whitespaces
\)          # Match a closing bracket
$           # Assert end of string
~x

Online php demo

Part 3: extracting the arguments

So here's another modular regex, you could even add your own grammar:

~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<array>
      Array\s*
      (?&brackets)
   )

   (?P<variable>
      [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            (?&array)             # Recurse/use the array pattern
            |                     # Or
            (?&variable)          # Recurse/use the array pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis

We will loop and use preg_match_all(). The final code would look like this:

$functionPattern = <<<'regex'
~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
_n?                               # Match _ or _n
\s*                               # Optional white spaces
(?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
~sx
regex;


$argumentsPattern = <<<'regex'
~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<array>
      Array\s*
      (?&brackets)
   )

   (?P<variable>
      [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            (?&array)             # Recurse/use the array pattern
            |                     # Or
            (?&variable)          # Recurse/use the array pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;

$input = <<<'input'
_  ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..

// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")",   '(')   );
_n(function(foo){return foo*2;}); // Is this even valid?
_n   ();   // Empty
_ (   
    "Foo",
    'Bar',
    Array(
        "wow",
        "much",
        'whitespaces'
    ),
    multiline
); // PCRE is awesome
input;

if(preg_match_all($functionPattern, $input, $m)){
    $filtered = preg_replace(
        '~          # Delimiter
        ^           # Assert begin of string
        \(          # Match an opening bracket
        \s*         # Match optional whitespaces
        |           # Or
        \s*         # Match optional whitespaces
        \)          # Match a closing bracket
        $           # Assert end of string
        ~x', // Regex
        '', // Replace with nothing
        $m['results'] // Subject
    ); // Getting rid of opening & closing brackets

    // Part 3: extract arguments:
    $parsedTree = array();
    foreach($filtered as $arguments){   // Loop
        if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
            $parsedTree[] = array(
                'all_arguments' => $arguments,
                'branches' => $m[0]
            ); // Add an array to our tree and fill it
        }else{
            $parsedTree[] = array(
                'all_arguments' => $arguments,
                'branches' => array()
            ); // Add an array with empty branches
        }
    }

    print_r($parsedTree); // Let's see the results;
}else{
    echo 'no matches';
}

Online php demo

You might want to create a recursive function to generate a full tree. See this answer.

You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)



回答2:

Try this:

(?<=\().*?(?=\s*\)[^)]*$)

See live demo



回答3:

Below regex should help you.

^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);

Check the demo here



回答4:

\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)

This should get anything between a pair of parenthesis, ignoring parenthesis in quotes. Explanation:

\( // Literal open paren
    (
         | //Space or
        "(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
        '(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
        [^)"'] //Any character that isn't a quote or close paren
    )*? // All that, as many times as necessary
\) // Literal close paren

No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?

// This is just pseudocode.  A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
    // Ignoring anything that isn't an opening paren
    if(input[i] == '(') {
        String capturedText = "";
        // Loop until a close paren is reached, or an EOF is reached
        for(; input[i] != ')' && i < input.length; i++) {
            if(input[i] == '"') {
                // Loop until an unescaped close quote is reached, or an EOF is reached
                for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
                    capturedText += input[i];
                }
            }
            if(input[i] == "'") {
                // Loop until an unescaped close quote is reached, or an EOF is reached
                for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
                    capturedText += input[i];
                }
            }
            capturedText += input[i];
        }
        capture(capturedText);
    }
}

Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.



回答5:

One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):

$string = '_("foo")
_n("bar", "baz", 42); 
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';

preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);

foreach($matches[0] as $test){
    $opArr = explode(',', $test);
    foreach($opArr as $test2){
       echo trim($test2) . "\n";
       }
    }

you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1

Output is:

"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples


回答6:

We can do this in two steps:

1)catch all function arguments for _n( or _( method calls

(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)

See demo.

http://regex101.com/r/oE6jJ1/13

2)catch the stringy ones only

"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))

See demo.

http://regex101.com/r/oE6jJ1/14