I am trying to retrieve specific fields from a text file which has a metadata as follows:
project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN
And I have the following script for retrieving the field 'cell'
while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo $cell
fi
done < files.txt
However the following script retrieves the whole field as cell=ABC
, whereas I just want the value 'ABC'
from the field, how do I retrieve the value after the regex, in the same line of code?
If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash
, ksh
, or zsh
, consider using the regex-matching operator, =~
: [[ string =~ regex ]]
:
Tip of the hat to @Adrian Frühwirth for the gist of the ksh
and zsh
solutions.
Sample input string:
string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
Shell-specific use of =~
is discussed next; a multi-shell implementation of the =~
functionality via a shell function can be found at the end.
bash
The special BASH_REMATCH
array variable receives the results of the matching operation: element 0
contains the entire match, element 1
the first capture group's (parenthesized subexpression's) match, and so on.
bash 3.2+
:
[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
bash 4.x
:
While the specific command above works, using regex literals in bash 4.x
is buggy, notably when involving word-boundary assertions \<
and \>
on Linux; e.g., [[ a =~ \<a ]]
inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]]
works (also on bash 3.2+
).
bash 3.0 and 3.1
- or after setting shopt -s compat31
:
Quote the regex to make it work:
[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
ksh
The ksh
syntax is the same as in bash
, except:
- the name of the special array variable that contains the matched strings is
.sh.match
(you must enclose the name in {...}
even when just implicitly referring to the first element with ${.sh.match}
):
[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'
zsh
The zsh
syntax is also similar to bash, except:
- The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as
;
.
- you may, but needn't double-quote a regex provided as a variable value.
- Note how this quoting behavior differs fundamentally from that of bash 3.2+:
zsh
requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
- There are 2 variables containing the matching results:
$MATCH
contains the entire matched string
- array variable
$match
contains only the matches for the capture groups (note that zsh
arrays start with index 1
and that you don't need to enclose the variable name in {...}
to reference array elements)
[[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'
Multi-shell implementation of the =~
operator as shell function reMatch
The following shell function abstracts away the differences between bash
, ksh
, zsh
with respect to the =~
operator; the matches are returned in array variable ${reMatches[@]}
.
As @Adrian Frühwirth notes, to write portable (across zsh
, ksh
, bash
) code with this, you need to execute setopt KSH_ARRAYS
in zsh
so as to make its arrays start with index 0
; as a side effect, you also have to use the ${...[]}
syntax when referencing arrays, as in ksh
and bash
).
Applied to our example we'd get:
# zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS
reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}
Shell function:
# SYNOPSIS
# reMatch string regex
# DESCRIPTION
# Multi-shell implementation of the =~ regex-matching operator;
# works in: bash, ksh, zsh
#
# Matches STRING against REGEX and returns exit code 0 if they match.
# Additionally, the matched string(s) is returned in array variable ${reMatch[@]},
# which works the same as bash's ${BASH_REMATCH[@]} variable: the overall
# match is stored in the 1st element of ${reMatch[@]}, with matches for
# capture groups (parenthesized subexpressions), if any, stored in the remaining
# array elements.
# NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
# reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[@]} == ('This AND that.', 'This', 'that')
function reMatch {
typeset ec
unset -v reMatch # initialize output variable
[[ $1 =~ $2 ]] # perform the regex test
ec=$? # save exit code
if [[ $ec -eq 0 ]]; then # copy result to output variable
[[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[@]}" )
[[ -n $KSH_VERSION ]] && reMatch=( "${.sh.match[@]}" )
[[ -n $ZSH_VERSION ]] && reMatch=( "$MATCH" "${match[@]}" )
fi
return $ec
}
Note:
function reMatch
(as opposed to reMatch()
) is used to declare the function, which is required for ksh
to truly create local variables with typeset
.
I would not use cut
, since you cannot specify more than one delimiter.
If your grep
supports PCRE
, then you can do:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC
You can use sed
, which in simple terms can be done as -
$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC
Another option is to use awk
. With that you can do the following by specifying list of delimiters you want to consider as field separators:
$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC
You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.
Note that if your shell does not support here-string notation <<<
then you can echo
the variable and pipe it to the command.
$ echo "$string" | cmd
Here's a native shell solution:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ cell=${string#*cell=}
$ cell=${cell%%;*}
$ echo "${cell}"
ABC
This removes the shortest leading match up to including cell=
from the string, then removes the longest trailing match up to including the ;
leaving you with ABC
.
Here's another solution which uses read
to split the strings:
$ cat t.sh
#!/bin/bash
while IFS=$'; \t' read -ra attributes; do
for foo in "${attributes[@]}"; do
IFS='=' read -r key value <<< "${foo}"
[ "${key}" = cell ] && echo "${value}"
done
done <<EOF
foo=X; cell=ABC; quux=Z;
foo=X; cell=DEF; quux=Z;
EOF
.
$ ./t.sh
ABC
DEF
For solutions using external tools see @jaypal's excellent answer.