Unique the columns and get the frequencies in linu

2019-09-16 20:42发布

问题:

I have a data.txt with a matrix structure (4 X 9):

101000110
000000010
001010010
100101101

I want to count the frequencies of unique columns, the expected result is:

1001 2
0000 1
1010 1
0001 3 
0010 1
1110 1

I only find "unique lines according to specific columns" using awk on the Internet, do I need to first transpose my data to solve this problem. I wonder whether there is a more direct way to figure it out? Thank you.

回答1:

This awk will help:

awk '{for (i=1;i<=NF;i++){
         a[i]=a[i]""$i
       }
     }
     END{
     for (i=1;i<=9;i++) {
       res[a[i]]++
       }
     for (r in res){
         print r, res[r] 
       }
     }' FS= yourfile

Result

1110 1
0000 1
0010 1
0001 3
1010 1
1001 2

Explanation

for (i=1;i<=NF;i++){
         a[i]=a[i]""$i
       }
     }

Stores the info in a nine column array as a key, as we know that it’s a regular matrix we will append each value to its position

 for (i=1;i<=9;i++) {
   res[a[i]]++
   }

Store the number into an associative array and count the occurrences

 for (r in res){
     print r, res[r] 
   }

Just show the final result.



回答2:

You don't need to transpose it. Use awk to split on empty field separator and append each value in an array indexed by column number. In the END block count the frequency and print it:

awk 'BEGIN{FS=""} {
   for (i=1; i<=NF; i++)
      a[i] = a[i] $i
}
END {
   for (i=1; i<=length(a); i++)
      freq[a[i]]++

   for(i in freq)
      print i, freq[i]
}' file

0000 1
0010 1
0001 3
1001 2
1010 1
1110 1


回答3:

Perl to the rescue:

perl -aF// -lne '$s[$_] .= $F[$_] for 0 .. $#F;
                 }{
                 $c{$_}++ for @s;
                 print "$_\t$c{$_}" for keys %c' < data.txt
  • -n reads the input line by line
  • -l handles the newlines
  • aF// split each line by characters to the @F array
  • @s accumulates characters from particular columns
  • At the end, the hash table %c is used to count the frequencies.


回答4:

although not needed, here is a tranpose and count solution with unix toolset.

$ sed 's/./&\n/g' file | 
  sed '/^$/d'          | 
  pr -4ts' '           | 
  tr -d ' '            | 
  sort                 | 
  uniq -c              | 
  awk '{print $2,$1}'

0000 1
0001 3
0010 1
1001 2
1010 1
1110 1