Weird behavior of BASH glob/regex ranges

I'm seeing BASH bracket ranges (e.g. [A-Z]) behaving in an unexpected way.
Is there's an explanation for such behavior, or it is a bug?

Let's say I have a variable, from which I want to strip all uppercase letters:

$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"

The result I get is this:

a0123

If I do it with sed, I get an expected result:

$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123

The same seems to be the case for BASH built-in regex match:

$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0

If I check all lowercase letters from 'a' to 'z', it seems that only 'a' is an exception:

$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a

I do not have case-insensitive matching enabled, and even if I did, it should not make letter 'a' behave differently:

$ shopt -p nocasematch
shopt -u nocasematch

For the reference, I'm using Cygwin, and I don't see this behavior on any other machine:

$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=

EDIT:

I've found the exact same issue reported here: https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So, I guess it's a bug(?) of "en_GB.UTF-8" collation, but not BASH itself.
Setting LC_COLLATE=C indeed solves this.

标签： regex bash shell cygwin glob

1条回答

孤傲高冷的网名

2楼-- · 2019-02-25 01:43

It certainly had to do with setting of your locale. An excerpt from the GNU bash man page under Pattern Matching

[..] in the default C locale, [a-dx-z] is equivalent to [abcdxyz]. Many locales sort characters in dictionary order, and in these locales [a-dx-z] is typically not equivalent to [abcdxyz]; it might be equivalent to [aBbCcDdxXyYz], for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting the LC_COLLATE or LC_ALL environment variable to the value C, or enable the globasciiranges shell option.[..]

Use the POSIX character-classess, [[:upper:]] in this case or change your locale setting LC_ALL or LC_COLLATE to C as mentioned above.

LC_ALL=C var='ABCDabcd0123'
echo "${var//[A-Z]/}"
abcd0123

Also, your negative test to do upper-case check will fail for all the lower case letters when setting this locale hence printing the letters,

LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done

Also, under the above locale setting

[[ a =~ [A-Z] ]] ; echo $?
1
[[ b =~ [A-Z] ]] ; echo $?
1

but will be true for all lower-case ranges,

[[ a =~ [a-z] ]] ; echo $?
0
[[ b =~ [a-z] ]] ; echo $?
0

Said this, all these can be avoided by using the POSIX specified character classes, under a new shell without any locale setting,

echo "${var//[[:upper:]]/}"
abcd0123

and

for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done

0人赞添加讨论(0) 举报

Weird behavior of BASH glob/regex ranges

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间