I have a file, which holds phone number data, and also some useless stuff.
I'm trying to parse the numbers out, and when there is only 1 phone number / line, it's not problem.
But when I have multiple numbers, sed matches the last one (even though everywhere it says it should match only match the first pattern?), and I can't get other numbers out..
My data.txt:
bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla
When I parse for the data, my idea was first to remove all the "initial" "bla bla bla" in front of the first phone number (so I search for first occurrence of 'NUM:'), then I remove all the stuff after phone number, and get the number.
After that I want to parse the next occurrence from the leftover string.
So now when I try to sed it, I always get the last number on the line:
>sed 's/.*NUM://' data.txt
08022222222 bla bla bla
>
Primarily I would like to understand what's wrong with my understanding of SED. Of course more efficient suggestions are welcome!
Doesn't my sed command say, replace all stuff before 'NUM:' with '' (empty)? Why it matches always the last occurrence ?
Thanks!
This might work for you:
echo "bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla" |
sed 's/NUM:/\n&/g;s/[^\n]*\n\(NUM:[0-9]*\)[^\n]*/\1 /g;s/.$//'
NUM:09011111111 NUM:08022222222
The problem you have is understanding that the .*
is greedy i.e. it matches the longest match not the first match. By placing a unique character (\n
sed uses it as a line delimiter so it cannot exist in the line) in front of the string we're interested in (NUM:...
) and deleting everything that is not that unique character [^\n]*
followed by the unique character \n
, we effectively split the string into manageable pieces.
As you know by now, sed
regexes are greedy and as far as I can tell can't be made non-greedy.
Two alternatives that haven't been brought up until now are to just use other tools for this kind of matching/extraction.
You can use perl
as a drop-in replacement for sed with the -pe
parameters. It supports the ?
non-greedy modifier:
$ perl -pe 's/.*?NUM://' data.txt
09011111111 bla bla bla bla NUM:08022222222 bla bla bla
You can use the -o
option to GNU grep to get only the bits of your data that match the regex:
$ egrep -o 'NUM:[0-9]*' data.txt
NUM:09011111111
NUM:08022222222
If a number is defined by digits following a NUM:
:
sed -n -e 's/$/\n/' -e ':begin' \
-e 's/\(NUM:[0-9][0-9]*\)\(.*\)\n\(.*\)/\2\n\3 \1/' \
-e 'tbegin' -e 's/.*\n //' -e '/NUM/p'
What this does is:
- Put a
\n
at the end of the line to act as a marker.
- Try to find a number before the marker, and put it at the end of the line (after the marker).
- If a number was found, goto 2 above.
- When no number are left before the marker, remove everything before the numbers.
- If a number is on the line, print it (to handle the case where no number are found.
It can also be done the other way around, first dropping lines without numbers:
sed -e '/NUM/!d' -e 's/$/\n/' -e ':begin' \
-e 's/\(NUM:[0-9][0-9]*\)\(.*\)\n\(.*\)/\2\n\3 \1/' \
-e 'tbegin' -e 's/.*\n //'
You can use this pattern:
sed -r 's/^(.*NUM:)(.*NUM:.*)$/\2/'