I am not finding any clear tutorial on this topic. Say I have an input file as:
1 abc
1 def
1 ghi
1 lalala
1 heyhey
2 ahb
2 bbh
3 chch
3 chchch
3 oiohho
3 nonon
3 halal
3 whatever
Say I would like to find the maximum number of column one appeared first, which is "3" that appeared 6 times. Then i will need to feed this number (i.e. 6) to another script to go through the file to do some computations. What are the ways to do this?
Basically, i wonder if it's possible to write a function to go through the file and find "max" then in the main function calling the helper function. Also, i wonder if it's possible to do $(...) within the helper function to call 'awk' or other system functions?
awk 'NR == FNR {nums[$1]++; next} ! flag {flag = 1; for (num in nums) {if (nums[i] > max) {max = nums[i]}}} {print max * $3}' filetomax filetoprocess
Here it is broken out on multiple lines:
awk '
NR == FNR {
nums[$1]++;
next
}
! flag {
flag = 1;
for (num in nums) {
if (nums[i] > max) {
max = nums[i]
}
}
}
{
print max * $3
}
' filetomax filetoprocess
Here, we're doing the same operation to find the max of the numbers that you've seen before. Instead of using a main block and an END
block, we're using a technique that's often used to process one file and then another. The NR == FNR
condition is only true while the first file is read because the record number (NR
) which is incremented for each line in all the files collectively is equal to the file record number (FNR
) which is reset for each new file. In the block associated with this condition, count the times each number appears. The next
statement causes execution to loop to read the next line from the files. When the second file is reached, the condition is no longer true and this block will be skipped.
The next conditional (! flag
) checks to see if the contents of the variable are true. Since it hasn't been set, it's false. The exclamation point negates the condition so at this point execution moves into this block. Now the flag is set so the next time the condition is checked, this block will be skipped. The for
loop checks to see which number appeared the most times, as in my answer to your other question.
Now, the second file can be processed in any way you like and the variable max
is available for use during this processing. I have simply used a print
statement to illustrate that. You can still use block selector conditionals, including one or more END
blocks as you normally would. I don't show a BEGIN
block, but you could add one at the top of this script for any initialization you need. Note that the processing of the first file could have been done in the BEGIN
block using getline
. That's simply another technique for accomplishing the same thing.
The filenames are listed in the order they are to be processed. The file to find the maximum counts in I've called "filetomax". The second file to do the main processing on I've called "filetoprocess".
We use a pipe for this. It takes the stdout of the first process and connects it to the stdin of the second.
awk ... | awk ...