To parse colon-delimited fields I can use read
with a custom IFS
:
$ echo 'foo.c:41:switch (color) {' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 41 | switch (color) {
If the last field contains colons, no problem, the colons are retained.
$ echo 'foo.c:42:case RED: //alert' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert
A trailing delimiter is also retained...
$ echo 'foo.c:42:case RED: //alert:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED: //alert:
...Unless it's the only extra delimiter. Then it's stripped. Wait, what?
$ echo 'foo.c:42:case RED:' | { IFS=: read file line text && echo "$file | $line | $text"; }
foo.c | 42 | case RED
Bash, ksh93, and dash all do this, so I'm guessing it is POSIX standard behavior.
- Why does it happen?
- What's the best alternative?
I want to parse the strings above into three variables and I don't want to mangle any text in the third field. I had thought read
was the way to go but now I'm reconsidering.
One "feature" of
read
is that it will strip leading and trailing whitespace separators in the variables it populates - it is explained in much more detail at the linked answer. This enables beginners to haveread
do what they expect when doing for exampleread first rest <<< ' foo bar '
(note the extra spaces).The take-away? It is hard to do accurate text processing using Bash and shell tools. If you want full control it's probably better to use a "stricter" language like for example Python, where
split()
will do what you want, but where you might have to dig much deeper into string handling to explicitly remove newline separators or handle encoding.Yes, that's standard behaviour (see the
read
specification and Field Splitting). A few shells (ash
-based includingdash
,pdksh
-based,zsh
,yash
at least) used not to do it, but except forzsh
(when not in POSIX mode),busybox
sh, most of them have been updated for POSIX compliance.That's the same for:
(see how the POSIX specification for
read
actually defers to the Field Splitting mechanism wherea:b:c:
is split into 3 fields, and so withIFS=: read -r a b c
, there are as many fields as variables).The rationale is that in
ksh
(on which the POSIX spec is based)$IFS
(initially in the Bourne shell the internal field separator) became a field delimiter, I think so any list of elements (not containing the delimiter) could be represented.When
$IFS
is a separator, one can't represent a list of one empty element (""
is split into a list of 0 element,":"
into a list of two empty elements¹). When it's a delimiter, you can express a list of zero element with""
, or one empty element with":"
, or two empty elements with"::"
.It's a bit unfortunate as one of the most common usages of
$IFS
is to split$PATH
. An a$PATH
like/bin:/usr/bin:
is meant to be split into"/bin"
,"/usr/bin"
,""
, not just"/bin"
and"/usr/bin"
.Now, with POSIX shells (but not all shells are compliant in that regard), for word splitting upon parameter expansion, that can be worked around with:
That trailing
""
makes sure that if$PATH
ends in a trailing:
, an extra empty element is added. And also that an empty$PATH
is treated as one empty element as it should be.That approach can't be used for
read
though.Short of switching to
zsh
, there's no easy work around other than inserting an extra:
and remove it afterwards like:Or (less portable):
I've also added the
-r
which you generally want when usingread
.Most likely here you'd want to use a proper text processing utility like
sed
/awk
/perl
instead of writing convoluted and probably inefficient code aroundread
which has not been designed for that.¹ Though in the Bourne shell, that was still split into zero elements as there was no distinction between IFS-whitespace and IFS-non-whitespace characters there, something that was also added by ksh