What is the behaviour of FS = “ ” in GNU Awk 4.2?

2019-07-19 15:57发布

问题:

The first week of October, Arnold Robbins announced Beta release of gawk 4.2.0 now available in the GNU-announce, bug-gawk and comp.lang.awk mailing lists. It is available in http://www.skeeve.com/gawk/gawk-4.1.65.tar.gz 1 and he mentions that This is a major release, with many significant new features.

So I went through the NEWS file to dig into these features and stopped in this point to do some tests:

Changes from 4.1.4 to 4.2.0

...

  1. Revisions in the POSIX standard remove the special case for POSIX mode when FS = " " where newline was not a field separator. The code and doc have been updated.

If I understand properly, he talks about GNU Awk User's Guide → 4.5.2 Using Regular Expressions to Separate Fields:

There is an important difference between the two cases of ‘FS = " "’ (a single space) and ‘FS = "[ \t\n]+"’ (a regular expression matching one or more spaces, TABs, or newlines). For both values of FS, fields are separated by runs (multiple adjacent occurrences) of spaces, TABs, and/or newlines. However, when the value of FS is " ", awk first strips leading and trailing whitespace from the record and then decides where the fields are.

That is, the difference between using FS = " " and FS = "[ \t\n]+".

I ran the new version and ran a test with the --posix mode:

$ ./gawk --posix -F" " '{print "NR:", NR; for(i=1;i<=NF;i++) print i, $i}' <<< "hello how are
you"
NR: 1
1 hello
2 how
3 are
NR: 2
1 you

And compared with my previous awk (4.1.3) and could not see any difference:

$ gawk --posix -F" " '{print "NR:", NR; for(i=1;i<=NF;i++) print i, $i}' <<< "hello how are
you"
NR: 1
1 hello
2 how
3 are
NR: 2
1 you

All in all, my question is: what is the difference in the behaviour of FS = " " in the --posix mode for GNU Awk 4.2? What has been changed exactly?

1 yes, I also thought it should be 4.2.tar.gz, but http://www.skeeve.com/gawk/gawk-4.2.tar.gz does not exist

回答1:

It's a beta release for 4.2 so it's built/named off 4.1. When it's official THEN it'll be 4.2.tar.gz.

I don't have the 4.2 beta handy to test the following theory but here's what I think the announcement with respect to the default FS=" " means:

Previously in POSIX when you set FS=" " that meant fields were separated by all white space characters except newline. gawk on the other hand included newline as one of the separators by default and you had to add --posix to get the POSIX behavior. Look:

$ gawk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)

$ printf 'a b\nc' | awk -v RS='^$' 'NR==1{for (i=1; i<=NF;i++) print NR, NF, i, "<" $i ">"}'
1 3 1 <a>
1 3 2 <b>
1 3 3 <c>

$ printf 'a b\nc' | awk --posix -v RS='^$' 'NR==1{for (i=1; i<=NF;i++) print NR, NF, i, "<" $i ">"}'
1 2 1 <a>
1 2 2 <b
c>

Apparently now the POSIX standard has been updated to include \n in the set of separator chars when FS=" " so gawk no longer needs to behave differently in that respect in posix vs non-posix mode and instead all POSIX awks need to be updated to behave as gawk did by default all along.

The example in your question doesn't test that because it's using \n as the RS (the default) and so cannot test what happens when \n is within a record. Try it again after setting RS="^$".



标签: awk posix gnu gawk