Command line to sum frequency in concatenated file

2019-08-19 02:24发布

I need to summarize the frequency of one column of several large tab-separated files. An example of the content in the file is :

Blue    table   3 
Blue    chair   2 
Big cat 1 
Small   cat 2

After concatenating the files, the trouble is the following:

Column 2 essentially is a frequency count of the amount of times the combination of Column 0 and Column 1 were seen together.

I need to add the frequency of all of the identical combinations in Column 2 of the concatenated file.

For instance: If in File A the contents are as follows:

Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2

and in File B the contents are as follows:

Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2

the contents in the concatenated File C are as follows:

Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2
Blue    table   3
Blue    chair   2
Big cat 1
Small   cat 2

I want to sum the frequencies of all identical combos in Column 0 and Column 1 in a File D to get the following results:

Blue    table   6
Blue    chair   4
Big cat 2
Small   cat 4

I tried to sort and count the info with the following command:

 sort <input_file> | uniq -c <output_file>

but the result is the following:

  2 Big cat 1
  2 Blue    chair   2
  2 Blue    table   3
  2 Small   cat 2

Does anyone have a suggestion of a terminal command that can produce my desired results?

Thank you in advance for any help.

1条回答
混吃等死
2楼-- · 2019-08-19 02:43

You're close; you have all the numbers you need. The total for each row is the count of rows that you got from uniq (column 1) times the frequency count (column 4). You can calculate that with awk:

sort input.txt | uniq -c  | awk ' {  print $2 "\t" $3 "\t" $1*$4 } '
查看更多
登录 后发表回答