I have a large tab-separated data table with thousands of rows and dozens of columns and it has missing data marked as "na". For example,
na 0.93 na 0 na 0.51
1 1 na 1 na 1
1 1 na 0.97 na 1
0.92 1 na 1 0.01 0.34
I would like to calculate the mean of each column, but making sure that the missing data are ignored in the calculation. For example, the mean of column 1 should be 0.97. I believe I could use awk
but I am not sure how to construct the command to do this for all columns and account for missing data.
All I know how to do is to calculate the mean of a single column but it treats the missing data as 0 rather than leaving it out of the calculation.
awk '{sum+=$1} END {print sum/NR}' filename
This is obscure, but works for your example
EDIT: Here is how it works:
```
A possible solution:
The output for the given data:
Note that the third column contains only "na" and the output is
0
. If you want the output to bena
, then change theEND{...}
-block to:END{for(i=1; i<= NF; i++){line=line""(denom[i] ? sum[i]/denom[i]:"na")FS} print line}'