Split string into array in bash

2020-07-09 06:19发布

问题:

I am looking for a way to split a string in bash over a delimiter string, and place the parts in an array.

Simple case:

#!/bin/bash
b="aaaaa/bbbbb/ddd/ffffff"
echo "simple string: $b"

IFS='/' b_split=($b)
echo ;
echo "split"
for i in ${b_split[@]}
do
    echo "------ new part ------"
    echo "$i"
done

Gives output

simple string: aaaaa/bbbbb/ddd/ffffff

split
------ new part ------
aaaaa
------ new part ------
bbbbb
------ new part ------
ddd
------ new part ------
ffffff

More complex case:

#!/bin/bash
c=$(echo "AA=A"; echo "B=BB"; echo "======="; echo "C==CC"; echo "DD=D"; echo "======="; echo "EEE"; echo "FF";)
echo "more complex string"
echo "$c";
echo ;
echo "split";

IFS='=======' c_split=($c) ;#    <----    LINE TO BE CHANGED 

for i in ${c_split[@]}
do
    echo "------ new part ------"
    echo "$i"
done

Gives output:

more complex string
AA=A
B=BB
=======
C==CC
DD=D
=======
EEE
FF

split
------ new part ------
AA
------ new part ------
A
B
------ new part ------
BB

------ new part ------

------ new part ------

------ new part ------

------ new part ------

------ new part ------

------ new part ------

------ new part ------

C
------ new part ------

------ new part ------
CC
DD
------ new part ------
D

------ new part ------

------ new part ------

------ new part ------

------ new part ------

------ new part ------

------ new part ------

------ new part ------

EEE
FF

I would like the second output to be like

------ new part ------
AA=A
B=BB
------ new part ------
C==CC
DD=D
------ new part ------
EEE
FF

I.e. to split the string on a sequence of characters, instead of one. How can I do this?

I am looking for an answer that would only modify this line in the second script:

IFS='=======' c_split=($c) ;#    <----    LINE TO BE CHANGED 

回答1:

IFS disambiguation

IFS mean Input Field Separators, as list of characters that could be used as separators.

By default, this is set to \t\n, meaning that any number (greater than zero) of space, tabulation and/or newline could be one separator.

So the string:

 "    blah  foo=bar 
 baz  "

Leading and trailing separators would be ignored and this string will contain only 3 parts: blah, foo=bar and baz.

Splitting a string using IFS is possible if you know a valid field separator not used in your string.

OIFS="$IFS"
IFS='§'
c=$'AA=A\nB=BB\n=======\nC==CC\nDD=D\n=======\nEEE\nFF'
c_split=(${c//=======/§})
IFS="$OIFS"
printf -- "------ new part ------\n%s\n" "${c_split[@]}"

------ new part ------
AA=A
B=BB

------ new part ------

C==CC
DD=D

------ new part ------

EEE
FF

But this work only while string do not contain §.

You could use another character, like IFS=$'\026';c_split=(${c//=======/$'\026'}) but anyway this may involve furter bugs.

You could browse character maps for finding one who's not in your string:

myIfs=""
for i in {1..255};do
    printf -v char "$(printf "\\\%03o" $i)"
        [ "$c" == "${c#*$char}" ] && myIfs="$char" && break
  done
if ! [ "$myIFS" ] ;then
    echo no split char found, could not do the job, sorry.
    exit 1
  fi

but I find this solution a little overkill.

Splitting on spaces (or without modifying IFS)

Under bash, we could use this bashism:

b="aaaaa/bbbbb/ddd/ffffff"
b_split=(${b//// })

In fact, this syntaxe ${varname// will initiate a translation (delimited by /) replacing all occurences of / by a space , before assigning it to an array b_split.

Of course, this still use IFS and split array on spaces.

This is not the best way, but could work with specific cases.

You could even drop unwanted spaces before splitting:

b='12 34 / 1 3 5 7 / ab'
b1=${b// }
b_split=(${b1//// })
printf "<%s>, " "${b_split[@]}" ;echo
<12>, <34>, <1>, <3>, <5>, <7>, <ab>, 

or exchange thems...

b1=${b// /§}
b_split=(${b1//// })
printf "<%s>, " "${b_split[@]//§/ }" ;echo
<12 34 >, < 1 3 5 7 >, < ab>, 

Splitting line on strings:

So you have to not use IFS for your meaning, but bash do have nice features:

#!/bin/bash

c=$'AA=A\nB=BB\n=======\nC==CC\nDD=D\n=======\nEEE\nFF'
echo "more complex string"
echo "$c";
echo ;
echo "split";

mySep='======='
while [ "$c" != "${c#*$mySep}" ];do
    echo "------ new part ------"
    echo "${c%%$mySep*}"
    c="${c#*$mySep}"
  done
echo "------ last part ------"
echo "$c"

Let see:

more complex string
AA=A
B=BB
=======
C==CC
DD=D
=======
EEE
FF

split
------ new part ------
AA=A
B=BB

------ new part ------

C==CC
DD=D

------ last part ------

EEE
FF

Nota: Leading and trailing newlines are not deleted. If this is needed, you could:

mySep=$'\n=======\n'

instead of simply =======.

Or you could rewrite split loop for keeping explicitely this out:

mySep=$'======='
while [ "$c" != "${c#*$mySep}" ];do
    echo "------ new part ------"
    part="${c%%$mySep*}"
    part="${part##$'\n'}"
    echo "${part%%$'\n'}"
    c="${c#*$mySep}"
  done
echo "------ last part ------"
c=${c##$'\n'}
echo "${c%%$'\n'}"

Any case, this match what SO question asked for (: and his sample :)

------ new part ------
AA=A
B=BB
------ new part ------
C==CC
DD=D
------ last part ------
EEE
FF

Finaly creating an array

#!/bin/bash
c=$'AA=A\nB=BB\n=======\nC==CC\nDD=D\n=======\nEEE\nFF'
echo "more complex string"
echo "$c";
echo ;
echo "split";

mySep=$'======='
export -a c_split
while [ "$c" != "${c#*$mySep}" ];do
    part="${c%%$mySep*}"
    part="${part##$'\n'}"
    c_split+=("${part%%$'\n'}")
    c="${c#*$mySep}"
  done
c=${c##$'\n'}
c_split+=("${c%%$'\n'}")

for i in "${c_split[@]}"
do
    echo "------ new part ------"
    echo "$i"
done

Do this finely:

more complex string
AA=A
B=BB
=======
C==CC
DD=D
=======
EEE
FF

split
------ new part ------
AA=A
B=BB
------ new part ------
C==CC
DD=D
------ new part ------
EEE
FF

Some explanations:

  • export -a var to define var as an array and share them in childs
  • ${variablename%string*}, ${variablename%%string*} result in the left part of variablename, upto but without string. One % mean last occurence of string and %% for all occurences. Full variablename is returned is string not found.
  • ${variablename#*string}, do same in reverse way: return last part of variablename from but without string. One # mean first occurence and two ## man all occurences.

Nota in replacement, character * is a joker mean any number of any character.

The command echo "${c%%$'\n'}" would echo variable c but without any number of newline at end of string.

So if variable contain Hello WorldZorGluBHello youZorGluBI'm happy,

variable="Hello WorldZorGluBHello youZorGluBI'm happy"

$ echo ${variable#*ZorGluB}
Hello youZorGlubI'm happy

$ echo ${variable##*ZorGluB}
I'm happy

$ echo ${variable%ZorGluB*}
Hello WorldZorGluBHello you

$ echo ${variable%%ZorGluB*}
Hello World

$ echo ${variable%%ZorGluB}
Hello WorldZorGluBHello youZorGluBI'm happy

$ echo ${variable%happy}
Hello WorldZorGluBHello youZorGluBI'm

$ echo ${variable##* }
happy

All this is explained in the manpage:

$ man -Len -Pless\ +/##word bash

$ man -Len -Pless\ +/%%word bash

$ man -Len -Pless\ +/^\\\ *export\\\ .*word bash

Step by step, the splitting loop:

The separator:

mySep=$'======='

Declaring c_split as an array (and could be shared with childs)

export -a c_split

While variable c do contain at least one occurence of mySep

while [ "$c" != "${c#*$mySep}" ];do

Trunc c from first mySep to end of string and assign to part.

    part="${c%%$mySep*}"

Remove leading newlines

    part="${part##$'\n'}"

Remove trailing newlines and add result as a new array element to c_split.

    c_split+=("${part%%$'\n'}")

Reassing c whith the rest of string when left upto mySep is removed

    c="${c#*$mySep}"

Done ;-)

done

Remove leading newlines

c=${c##$'\n'}

Remove trailing newlines and add result as a new array element to c_split.

c_split+=("${c%%$'\n'}")

Into a function:

ssplit() {
    local string="$1" array=${2:-ssplited_array} delim="${3:- }" pos=0
    while [ "$string" != "${string#*$delim}" ];do
        printf -v $array[pos++] "%s" "${string%%$delim*}"
        string="${string#*$delim}"
      done
    printf -v $array[pos] "%s" "$string"
}

Usage:

ssplit "<quoted string>" [array name] [delimiter string]

where array name is $splitted_array by default and delimiter is one single space.

You could use:

c=$'AA=A\nB=BB\n=======\nC==CC\nDD=D\n=======\nEEE\nFF'
ssplit "$c" c_split $'\n=======\n'
printf -- "--- part ----\n%s\n" "${c_split[@]}"
--- part ----
AA=A
B=BB
--- part ----
C==CC
DD=D
--- part ----
EEE
FF


回答2:

do it with awk:

 awk -vRS='\n=*\n'  '{print "----- new part -----";print}' <<< $c

output:

kent$  awk -vRS='\n=*\n'  '{print "----- new part -----";print}' <<< $c
----- new part -----
AA=A
B=BB
----- new part -----
C==CC
DD=D
----- new part -----
EEE
FF


回答3:

Following script tested in bash:

kent@7pLaptop:/tmp/test$ bash --version
GNU bash, version 4.2.42(2)-release (i686-pc-linux-gnu)

the script: (named t.sh)

#!/bin/bash

c=$(echo "AA=A"; echo "B=BB"; echo "======="; echo "C==CC"; echo "DD=D"; echo "======="; echo "EEE"; echo "FF";)
echo "more complex string"
echo "$c"
echo "split now"

c_split=($(echo "$c"|awk -vRS="\n=*\n"  '{gsub(/\n/,"\\n");printf $0" "}'))

for i in ${c_split[@]}
do
    echo "---- new part ----"
    echo -e "$i" 
done

output:

kent@7pLaptop:/tmp/test$ ./t.sh 
more complex string
AA=A
B=BB
=======
C==CC
DD=D
=======
EEE
FF
split now
---- new part ----
AA=A
B=BB
---- new part ----
C==CC
DD=D
---- new part ----
EEE
FF

note the echo statement in that for loop, if you remove the option -e you will see:

---- new part ----
AA=A\nB=BB
---- new part ----
C==CC\nDD=D
---- new part ----
EEE\nFF\n

take -e or not depends on your requirement.



回答4:

Here's an approach that doesn't fumble when the data contains literal backslash sequences, spaces and other:

c=$(echo "AA=A"; echo "B=BB"; echo "======="; echo "C==CC"; echo "DD=D"; echo "======="; echo "EEE"; echo "FF";)
echo "more complex string"
echo "$c";
echo ;
echo "split";

c_split=()
while IFS= read -r -d '' part
do
  c_split+=( "$part" )
done < <(printf "%s" "$c" | sed -e 's/=======/\x00/g')
c_split+=( "$part" )

for i in "${c_split[@]}"
do
    echo "------ new part ------"
    echo "$i"
done

Note that the string is actually split on "=======" as requested, so the line feeds become part of the data (causing extra blank lines when "echo" adds its own).



回答5:

Added some in the example text because of this comment:

This breaks if you replace AA=A with AA =A or with AA=\nA – that other guy

EDIT: I added a suggestion that isn't sensitive for some delimiter in the text. However this isn't using a "one line split" that OP was asking for, but this is how I should have done it if I would do it in bash, and want the result in an array.

script.sh (NEW):

#!/bin/bash

text=$(
  echo "AA=A"; echo "AA =A"; echo "AA=\nA"; echo "B=BB"; echo "=======";
  echo "C==CC"; echo "DD=D"; echo "======="; echo "EEE"; echo "FF";
)
echo "more complex string"
echo "$text"
echo "split now"

c_split[0]=""
current=""
del=""
ind=0

# newline
newl=$'\n'

# Save IFS (not necessary when run as sub shell)
saveIFS="$IFS"
IFS="$newl"
for row in $text; do

  if [[ $row =~ ^=+$ ]]; then
    c_split[$ind]="$current"
    ((ind++))
    current=""
    # Avoid preceding newline
    del=""
    continue
  fi

  current+="$del$row"
  del="$newl"
done

# Restore IFS
IFS="$saveIFS"

# If there is a last poor part of the text
if [[ -n $current ]]; then
  c_split[$ind]="$current"
fi

# The result is an array
for i in "${c_split[@]}"
do
    echo "---- new part ----"
    echo "$i"
done

script.sh (OLD, with "one line split"):
(I stool the idea with awk from @Kent and adjusted it a bit)

#!/bin/bash

c=$(
  echo "AA=A"; echo "AA =A"; echo "AA=\nA"; echo "B=BB"; echo "=======";
  echo "C==CC"; echo "DD=D"; echo "======="; echo "EEE"; echo "FF";
)
echo "more complex string"
echo "$c"
echo "split now"

# Now, this will be almost absolute secure,
# perhaps except a direct hit by lightning.
del=""
for ch in $'\1' $'\2' $'\3' $'\4' $'\5' $'\6' $'\7'; do
  if [ -z "`echo "$c" | grep "$ch"`" ]; then
    del="$ch"
    break
  fi
done

if [ -z "$del" ]; then
  echo "Sorry, all this testing but no delmiter to use..."
  exit 1
fi

IFS="$del" c_split=($(echo "$c" | awk -vRS="\n=+\n" -vORS="$del" '1'))

for i in ${c_split[@]}
do
  echo "---- new part ----"
  echo "$i"
done

Output:

[244an]$ bash --version
GNU bash, version 4.2.24(1)-release (x86_64-pc-linux-gnu)

[244an]$ ./script.sh
more complex string
AA=A
AA =A
AA=\nA
B=BB
=======
C==CC
DD=D
=======
EEE
FF
split now
---- new part ----
AA=A
AA =A
AA=\nA
B=BB
---- new part ----
C==CC
DD=D
---- new part ----
EEE
FF

I'm not using -e for echo, to get AA=\\nA to not do a newline