Subtract corresponding lines

2019-07-11 14:30发布

问题:

I have two files, file1.csv

3 1009
7 1012
2 1013
8 1014

and file2.csv

5 1009
3 1010
1 1013

In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.

The result would be

-2 1009
-3 1010 
7 1012
1 1013
8 1014

The files are huge (several GB). The second columns are sorted.

How would I do this efficiently in the shell?

回答1:

Assuming that both files are sorted on second column:

$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014

join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.

The output from join is three columns like:

1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0

Which is piped to awk, to subtract column three from column 2, and then reformatting.



回答2:

$ awk 'NR==FNR { a[$2]=$1; next }
               { a[$2]-=$1 }
           END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010

It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:

$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2  # previous $2 = current $2 -> subtract
3 1010 2  # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1  # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1

(I'm out of time for now, maybe I'll finish it later)

EDIT by Ed Morton Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:

$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
    print p[1] - $1, p[2]
    prev = ""
    next
}
p[2] != "" {
    print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
    split(prev,p)
    print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}

$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014


回答3:

Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:

$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0

All those options are required:

  • -j2 says to join based on the second column of each file
  • -o auto says to make every row have the same format, beginning with the join key
  • -e 0 says that missing values should be substituted with zero
  • -a 1 and -a 2 include rows that are absent from one file or another
  • the filenames (I've used names based on the question number here)

Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:

sed -re 's/.*/c&-n[ ]np/e'

This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)

c 1009 3 5 -n [ ]n p

which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving

-2 1009

as required.

We can then pipe all these lines into dc, giving us the output file that we want:

$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
>   | sed -e 's/.*/c& -n[ ]np/' \
>   | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014

¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.


TL;DR

The full command is:

join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc

It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.



回答4:

Assuming this is a csv with blank separation, if this is a "," use argument -F ','

awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
             {Discounts[$2]=$1; ids[$2]++}
     END     { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
    ' file1.csv file2.csv

for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)

awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
 | sort -k2 \
 > file.tmp
awk 'Last != $2 { 
        if (NR != 1) print Result " "Last
        Last = $2; Result = $1
        }
    Last == $2 { Result+= $1; next}
    END { print Result " " $2}
    ' file.tmp
rm file.tmp


标签: shell csv awk sed bc