I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My
sed
(Mac OS X) didn't work with+
. I tried*
instead and I addedp
tag for printing match:For matching at least one numeric character without
+
, I would use:I use
perl
to make this easier for myself. e.g.This runs Perl, the
-n
option instructs Perl to read in one line at a time from STDIN and execute the code. The-e
option specifies the instruction to run.The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks (
$1
).You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of
grep
supports it you could use the-o
option to print only the portion of any line that matches your regexp.If not then here's the best
sed
I could come up with:... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
.... or
... is that
sed
only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version ofsed
with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
output of the sample input file will be
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
You can use sed to do this
-n
don't print the resulting line-r
this makes it so you don't have the escape the capture group parens()
.\1
the capture group match/g
global match/p
print the resultI wrote a tool for myself that makes this easier
You can use
awk
withmatch()
to access the captured group:This tries to match the pattern
abc[0-9]+xyz
. If it does so, it stores its slices in the arraymatches
, whose first item is the block[0-9]+
. Sincematch()
returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers theprint
action.With
grep
you can use a look-behind and look-ahead:This checks the pattern
[0-9]+
when it occurs withinabc
andxyz
and just prints the digits.