I have a file containing a list of replacement pairs (about 100 of them) which are used by sed
to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
- Method: processing time
- Original script: 0.85s
cut
instead ofawk
: 0.71s- anubhava's method: 0.18s
- chthonicdaemon's method: 0.01s
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
IFS='|' will give enable read to populate name-value in 2 different shell variables
old
andnew
.This is assuming
~
is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.Here is what I would try:
sed
search-replace pair in a Bash array like ;This result in this command:
I recently benchmarked various string replacement methods, among them a custom program,
sed -e
,perl -lnpe
and an probably not that widely known MySQL command line utility,replace
.replace
being optimized for string replacements was almost an order of magnitude faster thansed
. The results looked something like this (slowest first):If you want performance, use
replace
. To have it available on your system, you'll need to install some MySQL distribution, though.From replace.c:
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through
sed
commands, something like this:Also, if you use
sed
orperl
and your system has an UTF-8 setup, then it also boosts performance to place aLANG=C
in front of the commands:You might want to do the whole thing in awk:
Build up a list of old and new words from the first file. The
next
ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The1
at the end means that the line is printed.More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so
--posix
with GNU sed)explaination
³
and for list with-End-
) for an easier sed handling (hard to use \n in class character in posix sed.-End-³
, remove the line and go to final printt again
)t again
). T is needed becauseb
does not reset the test and nextt
is always true.You can use
sed
to produce correctly -formattedsed
input: