I have the following script that parses some | delimited field/value pairs. Sample data looks like |Apple=32.23|Banana =1232.12|Grape=12312|Pear=231|Grape=1231|
I am just looking to count how many times A, B or C field names appear in the log file. The field list needs to be dynamic. Log files are 'big' about 500 megs each so it takes a while to sort each file. Is there a faster way to do the count once I do the cut and get a file with one field per line?
cat /bb/logs/$dir/$file.txt | tr -s "|" "\n" | cut -d "=" -f 1 | sort | uniq -c > /data/logs/$dir/$file.txt.count
I know for a fact that this part runs fast. I can see with certainty it gets bogged down in the sort.
cat /bb/logs/$dir/$file.txt | tr -s "|" "\n" | cut -d "=" -f 1
After I have run the cut a sample output is below, of course the file is much longer
Apple
Banana
Grape
Pear
Grape
After the sort and count I get
1 Apple
1 Banana
1 Pear
2 Grape
The problem is the sort for my actual data takes way too long. I think it would be faster to > the output of the cut to a file but not sure the fastest way to count unique entries in a 'large' text file
AWK can do it pretty well without sorting, try this, it should perform better;