In a Bash script I would like to split a line into pieces and store them in an array.
The line:
Paris, France, Europe
I would like to have them in an array like this:
array[0] = Paris
array[1] = France
array[2] = Europe
I would like to use simple code, the command's speed doesn't matter. How can I do it?
All of the answers to this question are wrong in one way or another.
Wrong answer #1
1: This is a misuse of
$IFS
. The value of the$IFS
variable is not taken as a single variable-length string separator, rather it is taken as a set of single-character string separators, where each field thatread
splits off from the input line can be terminated by any character in the set (comma or space, in this example).Actually, for the real sticklers out there, the full meaning of
$IFS
is slightly more involved. From the bash manual:Basically, for non-default non-null values of
$IFS
, fields can be separated with either (1) a sequence of one or more characters that are all from the set of "IFS whitespace characters" (that is, whichever of <space>, <tab>, and <newline> ("newline" meaning line feed (LF)) are present anywhere in$IFS
), or (2) any non-"IFS whitespace character" that's present in$IFS
along with whatever "IFS whitespace characters" surround it in the input line.For the OP, it's possible that the second separation mode I described in the previous paragraph is exactly what he wants for his input string, but we can be pretty confident that the first separation mode I described is not correct at all. For example, what if his input string was
'Los Angeles, United States, North America'
?2: Even if you were to use this solution with a single-character separator (such as a comma by itself, that is, with no following space or other baggage), if the value of the
$string
variable happens to contain any LFs, thenread
will stop processing once it encounters the first LF. Theread
builtin only processes one line per invocation. This is true even if you are piping or redirecting input only to theread
statement, as we are doing in this example with the here-string mechanism, and thus unprocessed input is guaranteed to be lost. The code that powers theread
builtin has no knowledge of the data flow within its containing command structure.You could argue that this is unlikely to cause a problem, but still, it's a subtle hazard that should be avoided if possible. It is caused by the fact that the
read
builtin actually does two levels of input splitting: first into lines, then into fields. Since the OP only wants one level of splitting, this usage of theread
builtin is not appropriate, and we should avoid it.3: A non-obvious potential issue with this solution is that
read
always drops the trailing field if it is empty, although it preserves empty fields otherwise. Here's a demo:Maybe the OP wouldn't care about this, but it's still a limitation worth knowing about. It reduces the robustness and generality of the solution.
This problem can be solved by appending a dummy trailing delimiter to the input string just prior to feeding it to
read
, as I will demonstrate later.Wrong answer #2
Similar idea:
(Note: I added the missing parentheses around the command substitution which the answerer seems to have omitted.)
Similar idea:
These solutions leverage word splitting in an array assignment to split the string into fields. Funnily enough, just like
read
, general word splitting also uses the$IFS
special variable, although in this case it is implied that it is set to its default value of <space><tab><newline>, and therefore any sequence of one or more IFS characters (which are all whitespace characters now) is considered to be a field delimiter.This solves the problem of two levels of splitting committed by
read
, since word splitting by itself constitutes only one level of splitting. But just as before, the problem here is that the individual fields in the input string can already contain$IFS
characters, and thus they would be improperly split during the word splitting operation. This happens to not be the case for any of the sample input strings provided by these answerers (how convenient...), but of course that doesn't change the fact that any code base that used this idiom would then run the risk of blowing up if this assumption were ever violated at some point down the line. Once again, consider my counterexample of'Los Angeles, United States, North America'
(or'Los Angeles:United States:North America'
).Also, word splitting is normally followed by filename expansion (aka pathname expansion aka globbing), which, if done, would potentially corrupt words containing the characters
*
,?
, or[
followed by]
(and, ifextglob
is set, parenthesized fragments preceded by?
,*
,+
,@
, or!
) by matching them against file system objects and expanding the words ("globs") accordingly. The first of these three answerers has cleverly undercut this problem by runningset -f
beforehand to disable globbing. Technically this works (although you should probably addset +f
afterward to reenable globbing for subsequent code which may depend on it), but it's undesirable to have to mess with global shell settings in order to hack a basic string-to-array parsing operation in local code.Another issue with this answer is that all empty fields will be lost. This may or may not be a problem, depending on the application.
Note: If you're going to use this solution, it's better to use the
${string//:/ }
"pattern substitution" form of parameter expansion, rather than going to the trouble of invoking a command substitution (which forks the shell), starting up a pipeline, and running an external executable (tr
orsed
), since parameter expansion is purely a shell-internal operation. (Also, for thetr
andsed
solutions, the input variable should be double-quoted inside the command substitution; otherwise word splitting would take effect in theecho
command and potentially mess with the field values. Also, the$(...)
form of command substitution is preferable to the old`...`
form since it simplifies nesting of command substitutions and allows for better syntax highlighting by text editors.)Wrong answer #3
This answer is almost the same as #2. The difference is that the answerer has made the assumption that the fields are delimited by two characters, one of which being represented in the default
$IFS
, and the other not. He has solved this rather specific case by removing the non-IFS-represented character using a pattern substitution expansion and then using word splitting to split the fields on the surviving IFS-represented delimiter character.This is not a very generic solution. Furthermore, it can be argued that the comma is really the "primary" delimiter character here, and that stripping it and then depending on the space character for field splitting is simply wrong. Once again, consider my counterexample:
'Los Angeles, United States, North America'
.Also, again, filename expansion could corrupt the expanded words, but this can be prevented by temporarily disabling globbing for the assignment with
set -f
and thenset +f
.Also, again, all empty fields will be lost, which may or may not be a problem depending on the application.
Wrong answer #4
This is similar to #2 and #3 in that it uses word splitting to get the job done, only now the code explicitly sets
$IFS
to contain only the single-character field delimiter present in the input string. It should be repeated that this cannot work for multicharacter field delimiters such as the OP's comma-space delimiter. But for a single-character delimiter like the LF used in this example, it actually comes close to being perfect. The fields cannot be unintentionally split in the middle as we saw with previous wrong answers, and there is only one level of splitting, as required.One problem is that filename expansion will corrupt affected words as described earlier, although once again this can be solved by wrapping the critical statement in
set -f
andset +f
.Another potential problem is that, since LF qualifies as an "IFS whitespace character" as defined earlier, all empty fields will be lost, just as in #2 and #3. This would of course not be a problem if the delimiter happens to be a non-"IFS whitespace character", and depending on the application it may not matter anyway, but it does vitiate the generality of the solution.
So, to sum up, assuming you have a one-character delimiter, and it is either a non-"IFS whitespace character" or you don't care about empty fields, and you wrap the critical statement in
set -f
andset +f
, then this solution works, but otherwise not.(Also, for information's sake, assigning a LF to a variable in bash can be done more easily with the
$'...'
syntax, e.g.IFS=$'\n';
.)Wrong answer #5
Similar idea:
This solution is effectively a cross between #1 (in that it sets
$IFS
to comma-space) and #2-4 (in that it uses word splitting to split the string into fields). Because of this, it suffers from most of the problems that afflict all of the above wrong answers, sort of like the worst of all worlds.Also, regarding the second variant, it may seem like the
eval
call is completely unnecessary, since its argument is a single-quoted string literal, and therefore is statically known. But there's actually a very non-obvious benefit to usingeval
in this way. Normally, when you run a simple command which consists of a variable assignment only, meaning without an actual command word following it, the assignment takes effect in the shell environment:This is true even if the simple command involves multiple variable assignments; again, as long as there's no command word, all variable assignments affect the shell environment:
But, if the variable assignment is attached to a command name (I like to call this a "prefix assignment") then it does not affect the shell environment, and instead only affects the environment of the executed command, regardless whether it is a builtin or external:
Relevant quote from the bash manual:
It is possible to exploit this feature of variable assignment to change
$IFS
only temporarily, which allows us to avoid the whole save-and-restore gambit like that which is being done with the$OIFS
variable in the first variant. But the challenge we face here is that the command we need to run is itself a mere variable assignment, and hence it would not involve a command word to make the$IFS
assignment temporary. You might think to yourself, well why not just add a no-op command word to the statement like the: builtin
to make the$IFS
assignment temporary? This does not work because it would then make the$array
assignment temporary as well:So, we're effectively at an impasse, a bit of a catch-22. But, when
eval
runs its code, it runs it in the shell environment, as if it was normal, static source code, and therefore we can run the$array
assignment inside theeval
argument to have it take effect in the shell environment, while the$IFS
prefix assignment that is prefixed to theeval
command will not outlive theeval
command. This is exactly the trick that is being used in the second variant of this solution:So, as you can see, it's actually quite a clever trick, and accomplishes exactly what is required (at least with respect to assignment effectation) in a rather non-obvious way. I'm actually not against this trick in general, despite the involvement of
eval
; just be careful to single-quote the argument string to guard against security threats.But again, because of the "worst of all worlds" agglomeration of problems, this is still a wrong answer to the OP's requirement.
Wrong answer #6
Um... what? The OP has a string variable that needs to be parsed into an array. This "answer" starts with the verbatim contents of the input string pasted into an array literal. I guess that's one way to do it.
It looks like the answerer may have assumed that the
$IFS
variable affects all bash parsing in all contexts, which is not true. From the bash manual:So the
$IFS
special variable is actually only used in two contexts: (1) word splitting that is performed after expansion (meaning not when parsing bash source code) and (2) for splitting input lines into words by theread
builtin.Let me try to make this clearer. I think it might be good to draw a distinction between parsing and execution. Bash must first parse the source code, which obviously is a parsing event, and then later it executes the code, which is when expansion comes into the picture. Expansion is really an execution event. Furthermore, I take issue with the description of the
$IFS
variable that I just quoted above; rather than saying that word splitting is performed after expansion, I would say that word splitting is performed during expansion, or, perhaps even more precisely, word splitting is part of the expansion process. The phrase "word splitting" refers only to this step of expansion; it should never be used to refer to the parsing of bash source code, although unfortunately the docs do seem to throw around the words "split" and "words" a lot. Here's a relevant excerpt from the linux.die.net version of the bash manual:You could argue the GNU version of the manual does slightly better, since it opts for the word "tokens" instead of "words" in the first sentence of the Expansion section:
The important point is,
$IFS
does not change the way bash parses source code. Parsing of bash source code is actually a very complex process that involves recognition of the various elements of shell grammar, such as command sequences, command lists, pipelines, parameter expansions, arithmetic substitutions, and command substitutions. For the most part, the bash parsing process cannot be altered by user-level actions like variable assignments (actually, there are some minor exceptions to this rule; for example, see the variouscompatxx
shell settings, which can change certain aspects of parsing behavior on-the-fly). The upstream "words"/"tokens" that result from this complex parsing process are then expanded according to the general process of "expansion" as broken down in the above documentation excerpts, where word splitting of the expanded (expanding?) text into downstream words is simply one step of that process. Word splitting only touches text that has been spit out of a preceding expansion step; it does not affect literal text that was parsed right off the source bytestream.Wrong answer #7
This is one of the best solutions. Notice that we're back to using
read
. Didn't I say earlier thatread
is inappropriate because it performs two levels of splitting, when we only need one? The trick here is that you can callread
in such a way that it effectively only does one level of splitting, specifically by splitting off only one field per invocation, which necessitates the cost of having to call it repeatedly in a loop. It's a bit of a sleight of hand, but it works.But there are problems. First: When you provide at least one NAME argument to
read
, it automatically ignores leading and trailing whitespace in each field that is split off from the input string. This occurs whether$IFS
is set to its default value or not, as described earlier in this post. Now, the OP may not care about this for his specific use-case, and in fact, it may be a desirable feature of the parsing behavior. But not everyone who wants to parse a string into fields will want this. There is a solution, however: A somewhat non-obvious usage ofread
is to pass zero NAME arguments. In this case,read
will store the entire input line that it gets from the input stream in a variable named$REPLY
, and, as a bonus, it does not strip leading and trailing whitespace from the value. This is a very robust usage ofread
which I've exploited frequently in my shell programming career. Here's a demonstration of the difference in behavior:The second issue with this solution is that it does not actually address the case of a custom field separator, such as the OP's comma-space. As before, multicharacter separators are not supported, which is an unfortunate limitation of this solution. We could try to at least split on comma by specifying the separator to the
-d
option, but look what happens:Predictably, the unaccounted surrounding whitespace got pulled into the field values, and hence this would have to be corrected subsequently through trimming operations (this could also be done directly in the while-loop). But there's another obvious error: Europe is missing! What happened to it? The answer is that
read
returns a failing return code if it hits end-of-file (in this case we can call it end-of-string) without encountering a final field terminator on the final field. This causes the while-loop to break prematurely and we lose the final field.Technically this same error afflicted the previous examples as well; the difference there is that the field separator was taken to be LF, which is the default when you don't specify the
-d
option, and the<<<
("here-string") mechanism automatically appends a LF to the string just before it feeds it as input to the command. Hence, in those cases, we sort of accidentally solved the problem of a dropped final field by unwittingly appending an additional dummy terminator to the input. Let's call this solution the "dummy-terminator" solution. We can apply the dummy-terminator solution manually for any custom delimiter by concatenating it against the input string ourselves when instantiating it in the here-string:There, problem solved. Another solution is to only break the while-loop if both (1)
read
returned failure and (2)$REPLY
is empty, meaningread
was not able to read any characters prior to hitting end-of-file. Demo:This approach also reveals the secretive LF that automatically gets appended to the here-string by the
<<<
redirection operator. It could of course be stripped off separately through an explicit trimming operation as described a moment ago, but obviously the manual dummy-terminator approach solves it directly, so we could just go with that. The manual dummy-terminator solution is actually quite convenient in that it solves both of these two problems (the dropped-final-field problem and the appended-LF problem) in one go.So, overall, this is quite a powerful solution. It's only remaining weakness is a lack of support for multicharacter delimiters, which I will address later.
Wrong answer #8
(This is actually from the same post as #7; the answerer provided two solutions in the same post.)
The
readarray
builtin, which is a synonym formapfile
, is ideal. It's a builtin command which parses a bytestream into an array variable in one shot; no messing with loops, conditionals, substitutions, or anything else. And it doesn't surreptitiously strip any whitespace from the input string. And (if-O
is not given) it conveniently clears the target array before assigning to it. But it's still not perfect, hence my criticism of it as a "wrong answer".First, just to get this out of the way, note that, just like the behavior of
read
when doing field-parsing,readarray
drops the trailing field if it is empty. Again, this is probably not a concern for the OP, but it could be for some use-cases. I'll come back to this in a moment.Second, as before, it does not support multicharacter delimiters. I'll give a fix for this in a moment as well.
Third, the solution as written does not parse the OP's input string, and in fact, it cannot be used as-is to parse it. I'll expand on this momentarily as well.
For the above reasons, I still consider this to be a "wrong answer" to the OP's question. Below I'll give what I consider to be the right answer.
Right answer
Here's a naïve attempt to make #8 work by just specifying the
-d
option:We see the result is identical to the result we got from the double-conditional approach of the looping
read
solution discussed in #7. We can almost solve this with the manual dummy-terminator trick:The problem here is that
readarray
preserved the trailing field, since the<<<
redirection operator appended the LF to the input string, and therefore the trailing field was not empty (otherwise it would've been dropped). We can take care of this by explicitly unsetting the final array element after-the-fact:The only two problems that remain, which are actually related, are (1) the extraneous whitespace that needs to be trimmed, and (2) the lack of support for multicharacter delimiters.
The whitespace could of course be trimmed afterward (for example, see How to trim whitespace from a Bash variable?). But if we can hack a multicharacter delimiter, then that would solve both problems in one shot.
Unfortunately, there's no direct way to get a multicharacter delimiter to work. The best solution I've thought of is to preprocess the input string to replace the multicharacter delimiter with a single-character delimiter that will be guaranteed not to collide with the contents of the input string. The only character that has this guarantee is the NUL byte. This is because, in bash (though not in zsh, incidentally), variables cannot contain the NUL byte. This preprocessing step can be done inline in a process substitution. Here's how to do it using awk:
There, finally! This solution will not erroneously split fields in the middle, will not cut out prematurely, will not drop empty fields, will not corrupt itself on filename expansions, will not automatically strip leading and trailing whitespace, will not leave a stowaway LF on the end, does not require loops, and does not settle for a single-character delimiter.
Trimming solution
Lastly, I wanted to demonstrate my own fairly intricate trimming solution using the obscure
-C callback
option ofreadarray
. Unfortunately, I've run out of room against Stack Overflow's draconian 30,000 character post limit, so I won't be able to explain it. I'll leave that as an exercise for the reader.Here's my hack!
Splitting strings by strings is a pretty boring thing to do using bash. What happens is that we have limited approaches that only work in a few cases (split by ";", "/", "." and so on) or we have a variety of side effects in the outputs.
The approach below has required a number of maneuvers, but I believe it will work for most of our needs!
Note that the characters in
$IFS
are treated individually as separators so that in this case fields may be separated by either a comma or a space rather than the sequence of the two characters. Interestingly though, empty fields aren't created when comma-space appears in the input because the space is treated specially.To access an individual element:
To iterate over the elements:
To get both the index and the value:
The last example is useful because Bash arrays are sparse. In other words, you can delete an element or add an element and then the indices are not contiguous.
To get the number of elements in an array:
As mentioned above, arrays can be sparse so you shouldn't use the length to get the last element. Here's how you can in Bash 4.2 and later:
in any version of Bash (from somewhere after 2.05b):
Larger negative offsets select farther from the end of the array. Note the space before the minus sign in the older form. It is required.
This is similar to the approach by Jmoney38, but using sed:
Prints 1
The accepted answer works for values in one line.
If the variable has several lines:
We need a very different command to get all lines:
while read -r line; do lines+=("$line"); done <<<"$string"
Or the much simpler bash readarray:
Printing all lines is very easy taking advantage of a printf feature:
UPDATE: Don't do this, due to problems with eval.
With slightly less ceremony:
e.g.