I'm looking for a way to replace strings of a file, matched by a regular expression, with another string that will be generated/evaluated out of the matched string.
For example, I want to replace the timestamps (timestamp + duration) in this file
1357222500 3600 ...
Maybe intermediate strings...
1357226100 3600 ...
Maybe intermediate strings...
...
By human readable date representations (date range).
Until now, I always used shell scripts like Bash to iterate over each line, matching for the line X, getting the matched group string and printing the line after processing, for example this way (from memory):
IFS="
"
for L in `cat file.txt`; do
if [[ "${L}" =~ ^([0-9]{1,10})\ ([0-9]{1,4})\ .*$ ]]; then
# Written as three lines for better readability/recognition
echo -n "`date --date=@${BASH_REMATCH[1]}` - "
echo -n "`date --date=@$(( ${BASH_REMATCH[1]} + ${BASH_REMATCH[2]} ))`"
echo ""
else
echo "$L"
fi
done
I wonder if there's something like this with a fictional(?) "sed-2.0":
cat file.txt | sed-2.0 's+/^\([0-9]\{1,10\}\) \([0-9]\{1,4\}\) .*$+`date --date="@\1"` - `date --date="@$(( \1 + \2 ))`'
Whereas the backticks in the sed-2.0 replacement will be evaluated as shell command passing the matched groups \1
and \2
.
I know that this does not work as expected, but I'd like to write someting like this.
Edit 1
Edit of question above: added missing echo ""
in if
of Bash script example.
This should be the expected output:
Do 3. Jan 15:15:00 CET 2013 - Do 3. Jan 16:15:00 CET 2013
Maybe intermediate strings...
Do 3. Jan 16:15:00 CET 2013 - Do 3. Jan 17:15:00 CET 2013
Maybe intermediate strings...
...
Note, that the timestamp depends on the timezone.
Edit 2
Edit of question above: fixed syntax error of Bash script example, added comment.
Edit 3
Edit of question above: fixed syntax error of Bash script example. Changed the phrase "old-school example" to "Bash script example".
Summary of Kent's and glenn jackman's answer
There's a huge difference in both approaches: the execution time. I've compared all four methods, here are the results:
gawk using strftime()
/usr/bin/time gawk '/^[0-9]+ [0-9]+ / {t1=$1; $1=strftime("%c -",t1); $2=strftime("%c",t1+$2)} 1' /tmp/test
...
0.06user 0.12system 0:00.30elapsed 60%CPU (0avgtext+0avgdata 1148maxresident)k
0inputs+0outputs (0major+327minor)pagefaults 0swaps
gawk using execution through getline
(Gnu AWK Manual)
/usr/bin/time gawk '/^[0-9]{1,10} [0-9]{1,4}/{l=$1+$2; "date --date=@"$1|getline d1; "date --date=@"l|getline d2;print d1" - "d2;next;}1' /tmp/test
...
1.89user 7.59system 0:10.34elapsed 91%CPU (0avgtext+0avgdata 5376maxresident)k
0inputs+0outputs (0major+557419minor)pagefaults 0swaps
Custom Bash script
./sed-2.0.sh /tmp/test
...
3.98user 10.33system 0:15.41elapsed 92%CPU (0avgtext+0avgdata 1536maxresident)k
0inputs+0outputs (0major+759829minor)pagefaults 0swaps
sed using e
option
/usr/bin/time sed -r 's#^([0-9]{1,10}) ([0-9]{1,4})(.*$)#echo $(date --date=@\1 )" - "$(date --date=@$((\1+\2)))#ge' /tmp/test
...
3.88user 16.76system 0:21.89elapsed 94%CPU (0avgtext+0avgdata 1272maxresident)k
0inputs+0outputs (0major+1253409minor)pagefaults 0swaps
Input data
for N in `seq 1 1000`; do echo -e "$(( 1357226100 + ( $N * 3600 ) )) 3600 ...\nSomething else ..." >> /tmp/test ; done
We can see that AWK using the strffime()
method is the fastest. But even the Bash script is faster than sed
with shell execution.
Kent showed us a more generic, universal way to accomplish what I've asked for. My question actually was not only limited to my timestamp example. In this case I had to do exactly this (replacing timestamp + duration by human readable date representation), but I had situations where I had to execute other code.
glenn jackman showed us a specific solution which is suitable for situations were you can do string operations and calculation directly in AWK.
So, it depends on the time you have (or time your script may run), the amount of the data and use case which method should be preferred.