Optimize shell script for multiple sed replacement

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.

The pairs go like:

old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2

and my current code is:

cat replacement_list | while read i
do
    old=$(echo "$i" | awk -F'|' '{print $1}')    #due to the need for extended regex
    new=$(echo "$i" | awk -F'|' '{print $2}')
    sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done

I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.

Are there any other ways of speeding up this script?

EDIT

Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.

One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:

([0-9])U|\10  #the extra brackets and escapes were required for my original code

Some details on the improvements (to be updated):

Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s

标签： bash shell sed

7条回答

祖国的老花朵

2楼-- · 2020-02-12 08:31

You can cut down unnecessary awk invocations and use BASH to break name-value pairs:

while IFS='|' read -r old new; do
   # echo "$old :: $new"
   sed -i "s~$old~$new~g" file
done < replacement_list

IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.

This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.

0人赞添加讨论(0) 举报

Evening l夕情丶

3楼-- · 2020-02-12 08:31

Here is what I would try:

store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.

patterns=(
  old new
  tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments

for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
  search=${patterns[i]};
  replace=${patterns[i+1]}; # … here we got the replacement part
  sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[@]} file

This result in this command:

sed -e s/old/new/g -e s/tobereplaced/replacement/g file

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2020-02-12 08:38

I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):

custom program > sed > LANG=C sed > perl > LANG=C perl > replace

If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.

From replace.c:

Replace strings in textfile

This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.

...

The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.

More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:

$ sed -e 's/A/B/g; ...' file.txt | \
  sed -e 's/B/C/g; ...' | \
  sed -e 's/C/D/g; ...' | \
  sed -e 's/D/E/g; ...' > out

Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:

$ LANG=C sed ...

0人赞添加讨论(0) 举报

forever°为你锁心

5楼-- · 2020-02-12 08:43

You might want to do the whole thing in awk:

awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file

Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.

0人赞添加讨论(0) 举报

神经病院院长

6楼-- · 2020-02-12 08:45

{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
   /^-End-³\n/ {s///;b done
      }
   s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
   t again
   s/^[^³]*³\n//
   t again
:done
  p
  }'

More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.

for posix sed (so --posix with GNU sed)

explaination

copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.

0人赞添加讨论(0) 举报

Emotional °昔

7楼-- · 2020-02-12 08:53

You can use sed to produce correctly -formatted sed input:

sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file

0人赞添加讨论(0) 举报

1 2 下一页

Optimize shell script for multiple sed replacement

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间