retrieve a word after a regular expression in shel

2020-02-18 05:18发布

问题:

I am trying to retrieve specific fields from a text file which has a metadata as follows:

project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN

And I have the following script for retrieving the field 'cell'

while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo  $cell
fi
done < files.txt

However the following script retrieves the whole field as cell=ABC , whereas I just want the value 'ABC' from the field, how do I retrieve the value after the regex, in the same line of code?

回答1:

If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash, ksh, or zsh, consider using the regex-matching operator, =~: [[ string =~ regex ]]:

Tip of the hat to @Adrian Frühwirth for the gist of the ksh and zsh solutions.

Sample input string:

string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'

Shell-specific use of =~ is discussed next; a multi-shell implementation of the =~ functionality via a shell function can be found at the end.


bash

The special BASH_REMATCH array variable receives the results of the matching operation: element 0 contains the entire match, element 1 the first capture group's (parenthesized subexpression's) match, and so on.

bash 3.2+:

[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'

bash 4.x:
While the specific command above works, using regex literals in bash 4.x is buggy, notably when involving word-boundary assertions \< and \> on Linux; e.g., [[ a =~ \<a ]] inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]] works (also on bash 3.2+).

bash 3.0 and 3.1 - or after setting shopt -s compat31:
Quote the regex to make it work:

[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]}  # -> $cell == 'ABC'

ksh

The ksh syntax is the same as in bash, except:

  • the name of the special array variable that contains the matched strings is .sh.match (you must enclose the name in {...} even when just implicitly referring to the first element with ${.sh.match}):
[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'

zsh

The zsh syntax is also similar to bash, except:

  • The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as ;.
    • you may, but needn't double-quote a regex provided as a variable value.
    • Note how this quoting behavior differs fundamentally from that of bash 3.2+: zsh requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
  • There are 2 variables containing the matching results:
    • $MATCH contains the entire matched string
    • array variable $match contains only the matches for the capture groups (note that zsh arrays start with index 1 and that you don't need to enclose the variable name in {...} to reference array elements)
 [[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'

Multi-shell implementation of the =~ operator as shell function reMatch

The following shell function abstracts away the differences between bash, ksh, zsh with respect to the =~ operator; the matches are returned in array variable ${reMatches[@]}.

As @Adrian Frühwirth notes, to write portable (across zsh, ksh, bash) code with this, you need to execute setopt KSH_ARRAYS in zsh so as to make its arrays start with index 0; as a side effect, you also have to use the ${...[]} syntax when referencing arrays, as in ksh and bash).

Applied to our example we'd get:

  # zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS

reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}

Shell function:

# SYNOPSIS
#   reMatch string regex
# DESCRIPTION
#   Multi-shell implementation of the =~ regex-matching operator;
#   works in: bash, ksh, zsh
#
#   Matches STRING against REGEX and returns exit code 0 if they match.
#   Additionally, the matched string(s) is returned in array variable ${reMatch[@]},
#   which works the same as bash's ${BASH_REMATCH[@]} variable: the overall
#   match is stored in the 1st element of ${reMatch[@]}, with matches for
#   capture groups (parenthesized subexpressions), if any, stored in the remaining
#   array elements.
#   NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
#   reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[@]} == ('This AND that.', 'This', 'that')
function reMatch {
  typeset ec
  unset -v reMatch # initialize output variable
  [[ $1 =~ $2 ]] # perform the regex test
  ec=$? # save exit code
  if [[ $ec -eq 0 ]]; then # copy result to output variable
    [[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[@]}" )
    [[ -n $KSH_VERSION ]]  && reMatch=( "${.sh.match[@]}" )
    [[ -n $ZSH_VERSION ]]  && reMatch=( "$MATCH" "${match[@]}" )
  fi
  return $ec
}

Note:

  • function reMatch (as opposed to reMatch()) is used to declare the function, which is required for ksh to truly create local variables with typeset.


回答2:

I would not use cut, since you cannot specify more than one delimiter.

If your grep supports PCRE, then you can do:

$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC

You can use sed, which in simple terms can be done as -

$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC

Another option is to use awk. With that you can do the following by specifying list of delimiters you want to consider as field separators:

$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC

You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.

Note that if your shell does not support here-string notation <<< then you can echo the variable and pipe it to the command.

$ echo "$string" | cmd


回答3:

Here's a native shell solution:

$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ cell=${string#*cell=}
$ cell=${cell%%;*}
$ echo "${cell}"
ABC

This removes the shortest leading match up to including cell= from the string, then removes the longest trailing match up to including the ; leaving you with ABC.

Here's another solution which uses read to split the strings:

$ cat t.sh
#!/bin/bash

while IFS=$'; \t' read -ra attributes; do
    for foo in "${attributes[@]}"; do
        IFS='=' read -r key value <<< "${foo}"
        [ "${key}" = cell ] && echo "${value}"
    done
done <<EOF
foo=X;  cell=ABC;  quux=Z;
foo=X;  cell=DEF;  quux=Z;
EOF

.

$ ./t.sh
ABC
DEF

For solutions using external tools see @jaypal's excellent answer.



标签: regex shell