可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am trying to do pattern replacement using SED script but its not working properly
sample_content.txt
288Y2RZDBPX1000000001dhana
JP2F64EI1000000002d
EU9V3IXI1000000003dfg1000000001dfdfds
XATSSSSFOO4dhanaUXIBB7TF71000000004adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN1000000005egw
patterns.txt
1000000001 9000000003
1000000002 2000000001
1000000003 3000000001
1000000004 4000000001
1000000005 5000000001
Expected output
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
I am able to do with single SED replacement like
sed 's/1000000001/1000000003/g' sample_content.txt
Note:
- Matching pattern is not in fixed position.
- Single line may have multiple matching value to replace in sample_content.txt
- Sample_content.txt and patterns.txt has > 1 Million records
File attachment link: https://drive.google.com/open?id=1dVzivKMirEQU3yk9KfPM6iE7tTzVRdt_
Could anyone suggest how can achieve this without affecting performance?
Updated on 11-Feb-2018
After analyzing the real file I just got a hint that there is a grade value at the 30 and 31th position. Which helps where and all we need to apply replacement.
If grade AB then replace the 10 digit phone number at 41-50 and 101-110
If grade BC then replace the 10 digit phone number at 11-20, 61-70 and 151-160
If grade DE then replace the 10 digit phone number at 1-10, 71-80, 151-160 and 181-190
Like this I am seeing 50 unique grades for 2 Million sample records.
{ grade=substr($0,110,2)} // identify grade
{
if (grade == "AB") {
print substr($0,41,10) ORS substr($0,101,10)
} else if(RT == "BC"){
print substr($0,11,10) ORS substr($0,61,10) ORS substr($0,151,10)
}
like wise 50 coiditions
}
May I know, whether this approach is advisable or anyother better approach?
回答1:
Benchmarks for future reference
Test environment:
Using your sample files patterns.txt
with 50,000 lines and contents.txt
also with 50,000 lines.
All lines from patterns.txt
are loaded in all solutions but only the first 1000 lines of contents.txt
are examined.
Testing laptop is equipped with a dual core 64bit Intel(R) Celeron(R) CPU N3050 @ 2.16GHz, 4 GB RAM, Debian 9 64bit Testing , gnu sed 4.4
and gnu awk 4.1.4
In all cases the output is sent to a new file to avoid the slow overhead for printing data on the screen.
Results:
1. RavinderSingh13 1st awk solution
$ time awk 'FNR==NR{a[$1]=$2;next} {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt <(head -n 1000 contents.txt) >newcontents.txt
real 19m54.408s
user 19m44.097s
sys 0m1.981s
2. EdMorton 1st awk Solution
$ time awk 'NR==FNR{map[$1]=$2;next}{for (old in map) {gsub(old,map[old])}print}' patterns.txt <(head -n1000 contents.txt) >newcontents.txt
real 20m3.420s
user 19m16.559s
sys 0m2.325s
3. Sed (my sed) solution
$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -n 1000 contents.txt) >newcontents.txt
real 1m1.070s
user 0m59.562s
sys 0m1.443s
4. Cyrus sed solution
$ time sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) <(head -n1000 contents.txt) >newcontents.txt
real 1m0.506s
user 0m59.871s
sys 0m1.209s
5. RavinderSingh13 2nd awk solution
$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt <(head -n 1000 contents.txt) >newcontents.txt
real 0m25.572s
user 0m25.204s
sys 0m0.040s
For a small amount of input data like 1000 lines, awk solution seems good.
Lets make make another test with 9000 lines this time to compare performance
6.RavinderSingh13 2nd awk solution with 9000 lines
$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt <(head -9000 contents.txt) >newcontents.txt
real 22m25.222s
user 22m19.567s
sys 0m2.091s
7. Sed Solution with 9000 lines
$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -9000 contents.txt) >newcontents.txt
real 9m7.443s
user 9m0.552s
sys 0m2.650s
8. Parallel Seds Solution with 9000 lines
$ cat sedpar.sh
s=$SECONDS
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -3000 contents.txt) >newcontents1.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +3001 contents.txt |head -3000) >newcontents2.txt &
sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +6001 contents.txt |head -3000) >newcontents3.txt &
wait
cat newcontents1.txt newcontents2.txt newcontents3.txt >newcontents.txt && rm -f newcontents1.txt newcontents2.txt newcontents3.txt
echo "seconds elapsed: $(($SECONDS-$s))"
$ time ./sedpar.sh
seconds elapsed: 309
real 5m16.594s
user 9m43.331s
sys 0m4.232s
Splitting the task to more commands like three parallel seds seems that can speed things up.
For those who would like to repeat the benchmarks on their own PC you can download files contents.txt
and patterns.txt
either by OP's links or by my github:
contents.txt
patterns.txt
回答2:
Give a try to this one . Should be fast.
$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt
This formats the data of `patterns.txt like bellow without actually changing patterns.txt real contents:
$ printf 's/%s/%s/g\n' $(<patterns.txt)
s/1000000001/9000000003/g
s/1000000002/2000000001/g
s/1000000003/3000000001/g
s/1000000004/4000000001/g
s/1000000005/5000000001/g
All above are then given with process substitution <(...)
to a simple sed
as a script file using
sed -f
switch = read sed commands from file
$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
回答3:
Could you please try following awk
and let me know if this helps you.
Solution 1st:
awk 'FNR==NR{a[$1]=$2;next} {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt sample_content.txt
Output will be as follows.
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
Explanation of solution 1st: Adding explanation too now here:
awk '
FNR==NR{ ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read.
##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read.
a[$1]=$2; ##creating an array a whose index is first field of line and value is 2nd field of current line.
next ##next will skip all further statements for now.
}
{
for(i in a){ ##Starting a for loop which traverse through array a all element.
match($0,i); ##Using match function of awk which will try to match index if array a present in variable i.
val=substr($0,RSTART,RLENGTH); ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value.
if(val){ ##Checking condition if variable val is NOT NULL then do following:
sub(val,a[i])} ##using sub function of awk to substitute variable val value with array a value of index i.
};
print ##Using print here to print the current line either changed or not changed one.
}
' patterns.txt sample_content.txt ##Mentioning the Input_file(s) name here.
Solution 2nd: Without traversing all the time to array as like first solution coming out of array when a match is found as follows:
awk '
FNR==NR{ ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read.
##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read.
a[$1]=$2; ##creating an array a whose index is first field of line and value is 2nd field of current line.
next ##next will skip all further statements for now.
}
{
for(i in a){ ##Starting a for loop which traverse through array a all element.
match($0,i); ##Using match function of awk which will try to match index if array a present in variable i.
val=substr($0,RSTART,RLENGTH); ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value.
if(val){ ##Checking condition if variable val is NOT NULL then do following:
sub(val,a[i]);print;next} ##using sub function of awk to subsitute variable val value with array a value of index i.
};
}
1
' patterns.txt sample_content.txt ##Mentioning the Input_file(s) name here.
回答4:
The simple approach is:
$ cat tst.awk
NR==FNR {
map[$1] = $2
next
}
{
for (old in map) {
gsub(old,map[old])
}
print
}
$ awk -f tst.awk patterns.txt sample_content.txt
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
Just like the other solutions posted so far this applies every substitution to the whole line and so given sample_content.txt containing xay
and patterns.txt including a b
and b c
, then the tools would output xcy
rather than xby
.
Alternatively you could try this:
$ cat tst.awk
NR==FNR {
map[$1] = $2
re = re sep $1
sep = "|"
next
}
{
head = ""
tail = $0
while ( match(tail,re) ) {
head = head substr(tail,1,RSTART-1) map[substr(tail,RSTART,RLENGTH)]
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}
$ awk -f tst.awk patterns.txt sample_content.txt
288Y2RZDBPX9000000003dhana
JP2F64EI2000000001d
EU9V3IXI3000000001dfg9000000003dfdfds
XATSSSSFOO4dhanaUXIBB7TF74000000001adf
10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
That approach has several advantages:
- It would output
xby
(which is what I suspect you'd really want if that situation arose) in the case I mention above
- It only does as many regexp comparisons per line of sample_content.txt as could match instead of 1 per line of patterns.txt for every line of sample_content.txt
- It only operates on whats left of the line after the previous replacement so the string being tested keeps shrinking
- It doesn't change
$0
and so awk doesn't have to recompile and resplit that record with every subsitution.
so it should be much faster than the original script assuming the regexp constructed from patterns.txt isn't so huge it causes a performance degradation just by it's size.