I have a data.txt
with a matrix structure (4 X 9):
101000110
000000010
001010010
100101101
I want to count the frequencies of unique columns, the expected result is:
1001 2
0000 1
1010 1
0001 3
0010 1
1110 1
I only find "unique lines according to specific columns" using awk
on the Internet, do I need to first transpose my data to solve this problem. I wonder whether there is a more direct way to figure it out? Thank you.
This awk
will help:
awk '{for (i=1;i<=NF;i++){
a[i]=a[i]""$i
}
}
END{
for (i=1;i<=9;i++) {
res[a[i]]++
}
for (r in res){
print r, res[r]
}
}' FS= yourfile
Result
1110 1
0000 1
0010 1
0001 3
1010 1
1001 2
Explanation
for (i=1;i<=NF;i++){
a[i]=a[i]""$i
}
}
Stores the info in a nine column array as a key, as we know that it’s a regular matrix we will append each value to its position
for (i=1;i<=9;i++) {
res[a[i]]++
}
Store the number into an associative array and count the occurrences
for (r in res){
print r, res[r]
}
Just show the final result.
You don't need to transpose it. Use awk
to split on empty field separator and append each value in an array indexed by column number. In the END
block count the frequency and print it:
awk 'BEGIN{FS=""} {
for (i=1; i<=NF; i++)
a[i] = a[i] $i
}
END {
for (i=1; i<=length(a); i++)
freq[a[i]]++
for(i in freq)
print i, freq[i]
}' file
0000 1
0010 1
0001 3
1001 2
1010 1
1110 1
Perl to the rescue:
perl -aF// -lne '$s[$_] .= $F[$_] for 0 .. $#F;
}{
$c{$_}++ for @s;
print "$_\t$c{$_}" for keys %c' < data.txt
-n
reads the input line by line
-l
handles the newlines
aF//
split each line by characters to the @F array
- @s accumulates characters from particular columns
- At the end, the hash table %c is used to count the frequencies.
although not needed, here is a tranpose and count solution with unix toolset.
$ sed 's/./&\n/g' file |
sed '/^$/d' |
pr -4ts' ' |
tr -d ' ' |
sort |
uniq -c |
awk '{print $2,$1}'
0000 1
0001 3
0010 1
1001 2
1010 1
1110 1