Replace strings with evaluated string based on mat

2019-05-23 08:32发布

问题:

I'm looking for a way to replace strings of a file, matched by a regular expression, with another string that will be generated/evaluated out of the matched string.

For example, I want to replace the timestamps (timestamp + duration) in this file

1357222500 3600 ...
Maybe intermediate strings...
1357226100 3600 ...
Maybe intermediate strings...
...

By human readable date representations (date range).

Until now, I always used shell scripts like Bash to iterate over each line, matching for the line X, getting the matched group string and printing the line after processing, for example this way (from memory):

IFS="
"
for L in `cat file.txt`; do
  if [[ "${L}" =~ ^([0-9]{1,10})\ ([0-9]{1,4})\ .*$ ]]; then
    # Written as three lines for better readability/recognition
    echo -n "`date --date=@${BASH_REMATCH[1]}` - "
    echo -n "`date --date=@$(( ${BASH_REMATCH[1]} + ${BASH_REMATCH[2]} ))`"
    echo ""
  else
    echo "$L"
  fi
done

I wonder if there's something like this with a fictional(?) "sed-2.0":

cat file.txt | sed-2.0 's+/^\([0-9]\{1,10\}\) \([0-9]\{1,4\}\) .*$+`date --date="@\1"` - `date --date="@$(( \1 + \2 ))`'

Whereas the backticks in the sed-2.0 replacement will be evaluated as shell command passing the matched groups \1 and \2.

I know that this does not work as expected, but I'd like to write someting like this.

Edit 1

Edit of question above: added missing echo "" in if of Bash script example.

This should be the expected output:

Do 3. Jan 15:15:00 CET 2013 - Do 3. Jan 16:15:00 CET 2013
Maybe intermediate strings...
Do 3. Jan 16:15:00 CET 2013 - Do 3. Jan 17:15:00 CET 2013
Maybe intermediate strings...
...

Note, that the timestamp depends on the timezone.

Edit 2

Edit of question above: fixed syntax error of Bash script example, added comment.

Edit 3

Edit of question above: fixed syntax error of Bash script example. Changed the phrase "old-school example" to "Bash script example".


Summary of Kent's and glenn jackman's answer

There's a huge difference in both approaches: the execution time. I've compared all four methods, here are the results:

gawk using strftime()

/usr/bin/time gawk '/^[0-9]+ [0-9]+ / {t1=$1; $1=strftime("%c -",t1); $2=strftime("%c",t1+$2)} 1' /tmp/test
...
0.06user 0.12system 0:00.30elapsed 60%CPU (0avgtext+0avgdata 1148maxresident)k
0inputs+0outputs (0major+327minor)pagefaults 0swaps

gawk using execution through getline (Gnu AWK Manual)

/usr/bin/time gawk '/^[0-9]{1,10} [0-9]{1,4}/{l=$1+$2; "date --date=@"$1|getline d1; "date --date=@"l|getline d2;print d1" - "d2;next;}1' /tmp/test
...
1.89user 7.59system 0:10.34elapsed 91%CPU (0avgtext+0avgdata 5376maxresident)k
0inputs+0outputs (0major+557419minor)pagefaults 0swaps

Custom Bash script

./sed-2.0.sh /tmp/test
...
3.98user 10.33system 0:15.41elapsed 92%CPU (0avgtext+0avgdata 1536maxresident)k
0inputs+0outputs (0major+759829minor)pagefaults 0swaps

sed using e option

/usr/bin/time sed -r 's#^([0-9]{1,10}) ([0-9]{1,4})(.*$)#echo $(date --date=@\1 )" - "$(date --date=@$((\1+\2)))#ge' /tmp/test
...
3.88user 16.76system 0:21.89elapsed 94%CPU (0avgtext+0avgdata 1272maxresident)k
0inputs+0outputs (0major+1253409minor)pagefaults 0swaps

Input data

for N in `seq 1 1000`; do echo -e "$(( 1357226100 + ( $N * 3600 ) )) 3600 ...\nSomething else ..." >> /tmp/test ; done

We can see that AWK using the strffime() method is the fastest. But even the Bash script is faster than sed with shell execution.

Kent showed us a more generic, universal way to accomplish what I've asked for. My question actually was not only limited to my timestamp example. In this case I had to do exactly this (replacing timestamp + duration by human readable date representation), but I had situations where I had to execute other code.

glenn jackman showed us a specific solution which is suitable for situations were you can do string operations and calculation directly in AWK.

So, it depends on the time you have (or time your script may run), the amount of the data and use case which method should be preferred.

回答1:

awk oneliner: (the datetime format could be different from your output)

awk '/^[0-9]{1,10} [0-9]{1,4}/{l=$1+$2; "date --date=@"$1|getline d1; "date --date=@"l|getline d2;print d1" - "d2;next;}1' file

test:

kent$  echo "1357222500 3600 ...
Maybe intermediate strings...
1357226100 3600 ...
Maybe intermediate strings...
..."|awk '/^[0-9]{1,10} [0-9]{1,4}/{l=$1+$2; "date --date=@"$1|getline d1; "date --date=@"l|getline d2;print d1" - "d2;next;}1'    
Thu Jan  3 15:15:00 CET 2013 - Thu Jan  3 16:15:00 CET 2013
Maybe intermediate strings...
Thu Jan  3 15:15:00 CET 2013 - Thu Jan  3 17:15:00 CET 2013
Maybe intermediate strings...
...

Gnu sed

if you have gnu sed, the idea from your "not working" sed line could work in real world by applying gnu sed's s/foo/shell cmds/ge see below:

sed -r 's#^([0-9]{1,10}) ([0-9]{1,4})(.*$)#echo $(date --date=@\1 )" - "$(date --date=@$((\1+\2)))#ge'  file

test

kent$  echo "1357222500 3600 ...
Maybe intermediate strings...
1357226100 3600 ...
Maybe intermediate strings...
..."|sed -r 's#^([0-9]{1,10}) ([0-9]{1,4})(.*$)#echo $(date --date=@\1 )" - "$(date --date=@$((\1+\2)))#ge'                                                                 
Thu Jan 3 15:15:00 CET 2013 - Thu Jan 3 16:15:00 CET 2013
Maybe intermediate strings...
Thu Jan 3 16:15:00 CET 2013 - Thu Jan 3 17:15:00 CET 2013
Maybe intermediate strings...
...

if I would work on this, personally I would go with awk. because it is straightforward and easy to write.

at the end I paste my sed/awk version info :

kent$  sed --version|head -1
sed (GNU sed) 4.2.2

kent$  awk -V|head -1
GNU Awk 4.0.1


回答2:

based on your sample input:

gawk '/^[0-9]+ [0-9]+ / {t1=$1; $1=strftime("%c -",t1); $2=strftime("%c",t1+$2)} 1'

outputs

Thu 03 Jan 2013 09:15:00 AM EST - Thu 03 Jan 2013 10:15:00 AM EST ...
Maybe intermediate strings...
Thu 03 Jan 2013 10:15:00 AM EST - Thu 03 Jan 2013 11:15:00 AM EST ...
Maybe intermediate strings...
...