How to count number of unique values of a field in

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,

Red     Ball 1 Sold
Blue    Bat  5 OnSale
...............

So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.

I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.

Addendum: Thanks everyone for the help, can I ask one more thing? What if I wanted a count of these unique values as well?

I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

标签： linux bash command-line

7条回答

Animai°情兽

2楼-- · 2019-03-08 19:35

This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.

Code

#!/bin/bash

awk '
(NR==1){
    for(fi=1; fi<=NF; fi++)
        fname[fi]=$fi;
} 
(NR!=1){
    for(fi=1; fi<=NF; fi++) 
        arr[fname[fi]][$fi]++;
} 
END{
    for(fi=1; fi<=NF; fi++){
        out=fname[fi];
        for (item in arr[fname[fi]])
            out=out"\t"item"_"arr[fname[fi]][item];
        print(out);
    }
}
' $1

Execution Example:

bash> ./script.sh <path to tab-delimited file>

Output Example

isRef    A_15      C_42     G_24     T_18
isCar    YEA_10    NO_40    NA_50
isTv     FALSE_33  TRUE_66

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2019-03-08 19:39

You can make use of cut, sort and uniq commands as follows:

cat input_file | cut -f 1 | sort | uniq

gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.

Avoiding UUOC :)

cut -f 1 input_file | sort | uniq

EDIT:

To count the number of unique occurences you can make use of wc command in the chain as:

cut -f 1 input_file | sort | uniq | wc -l

0人赞添加讨论(0) 举报

Ridiculous、

4楼-- · 2019-03-08 19:44

cat test.csv | awk '{ a[$1]++ } END { for (n in a) print n, a[n] } '

0人赞添加讨论(0) 举报

何必那么认真

5楼-- · 2019-03-08 19:44

Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.

#!/bin/bash
# Syntax: $0 filename   
# The input is assumed to be a .tsv file

FILE="$1"

cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
  echo Column $i ::
  cut -f $i < "$FILE" | sort | uniq -c
  echo
done

0人赞添加讨论(0) 举报

时光不老，我们不散

6楼-- · 2019-03-08 19:45

# COLUMN is integer column number
# INPUT_FILE is input file name

cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l

0人赞添加讨论(0) 举报

【Aperson】

7楼-- · 2019-03-08 19:55

You can use awk, sort & uniq to do this, for example to list all the unique values in the first column

awk < test.txt '{print $1}' | sort | uniq

As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l

0人赞添加讨论(0) 举报

1 2 下一页

How to count number of unique values of a field in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间