I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
To see a frequency count for column two (for example):
fileA.txt
fileB.txt
fileC.txt
Result:
Ruby(1.9+)
Perl
This code computes the occurrences of all columns, and prints a sorted report for each of them:
Save the text as columnvalues.pl
Run it as:
perl columnvalues.pl files*
Explanation
In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the @Fields array
* For every column, increment the result array-of-hashes data structure
In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence
Results based on the sample input files provided by @Dennis
.csv input
If your input files are .csv, change
/\s+/
to/,/
Obfuscation
In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
sort -nr
(and reverseword
andfreq[word]
) to see the result in descending order.freq[3]++
- replace 3 with the column number.Here goes:
Here is a way to do it in the shell:
This is the sort of thing bash is great at.