awk: fatal: Invalid regular expression when settin

2020-03-08 03:44发布

问题:

I was trying to solve Grep regex to select only 10 character using awk. The question consists in a string XXXXXX[YYYYY--ZZZZZ and the OP wants to print the text in between the unique [ and -- strings within the text.

If it was just one - I would say use [-[] as field separator (FS). This is setting the FS to be either - or [:

$ echo "XXXXXXX[YYYYY-ZZZZ" | awk -F[-[] '{print $2}'
YYYYY

The tricky point is that [ has also a special meaning as a character class, so that to make it be correctly interpreted as one of the possible FS it cannot be written in the first position. Well, this is done by saying [-[]. So we are done to match either - or [.

However, in this case it is not one but two hyphens: I want to say either -- or [. I cannot say [--[] because the hyphen also has a meaning to define a range.

What I can do is to use -F"one pattern|another pattern" like:

$ echo "XXXXXXXaaYYYYYbbZZZZ" | awk -F"aa|bb" '{print $2}'
YYYYY

So if I try to use this with -- and [, I cannot get a proper result:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[" '{print $2}'
awk: fatal: Invalid regular expression: /--|[/

And in fact, not even having [ as one of the terms:

$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[" '{print $2}'
awk: fatal: Invalid regular expression: /bb|[/

$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|\[" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Invalid regular expression: /bb|[/

$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"(bb|\[)" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Unmatched [ or [^: /(bb|[)/

You see I tried to either escaping [, enclosing in parentheses and nothing worked.

So: what can I do to set the field separator to either -- or [? Is it possible at all?

回答1:

IMHO this is best explained if we start by looking at a regexp being used by the split() command since that explicitly shows what is happening when a string is split into fields using a literal vs dynamic regexp and then we can relate that to Field Separators.

This uses a literal regexp (delimited by /s):

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/\[|--/); print f[2]}'
YYYYY

and so requires the [ to be escaped so it is taken literally since [ is a regexp metacharacter.

These use a dynamic regexp (one stored as a string):

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,"\\[|--"); print f[2]}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk 'BEGIN{re="\\[|--"} {split($0,f,re); print f[2]}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re='\\[|--' '{split($0,f,re); print f[2]}'
YYYYY

and so require the [ to be escaped 2 times since awk has to convert the string holding the regexp (a variable named re in the last 2 examples) to a regexp (which uses up one backslash) before it's used as the separator in the split() call (which uses up the second backslash).

This:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re="\\\[|--" '{split($0,f,re); print f[2]}'
YYYYY

exposes the variable contents to the shell for it's evaluation and so requires the [ to be escaped 3 times since the shell parses the string first to try to expand shell variables etc. (which uses up one backslash) and then awk has to convert the string holding the regexp to a regexp (which uses up a second backslash) before it's used as the separator in the split() call (which uses up the third backslash).

A Field Separator is just a regexp stored as variable named FS (like re above) with some extra semantics so all of the above applies to it to, hence:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '\\[|--' '{print $2}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "\\\[|--" '{print $2}'
YYYYY

Note that we could have used a bracket expression instead of escaping it to have the [ treated literally:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/[[]|--/); print f[2]}'
YYYYY

and then we don't have to worry about escaping the escapes as we add layers of parsing:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "[[]|--" '{print $2}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '[[]|--' '{print $2}'
YYYYY


回答2:

You need to use double backslash for escaping regex meta chars inside double quoted string so that it would be treated as regex meta character otherwise (if you use single backslash) it would be treated as ecape sequence.

$ echo 'XXXXXXX[YYYYYbbZZZZ' | awk -v FS="bb|\\[" '{print $2}'
YYYYY


回答3:

This with GNU Awk 3.1.7

echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[[]" '{print $2}'    
echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[[]" '{print $2}'


标签: regex awk gawk