Validating unique values of a column in shell

2019-08-18 03:48发布

问题:

I get an input file vendor.csv which has a column called retailer. I have a predefined list of valid retailer values which are a,b,c. If 'd' comes in the retailer column I will have to take some action , mostly echo it to a log and stop the processing and notify the user.

I have done the following so far

f1=/stage/Scripts/ecommerce/vendor/final*.csv
k=`cut -d, -f1 $f1 |sort -u`
echo $k

This gives me

a b c d

The above o/p is not comma seperated

I can store the valid values a b c in a file or a string , for the above case

How do I make a check now ? Is this the best way to go about this

the valid values are ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

The existing data contains the following unique data points ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

So the "Hy-Vee Total Ecom TA" is an invalid value.

Here is my attempt with grep

$ echo $s
ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

 echo $k
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

grep -v "$s" "$k"

It gave me an error

grep: ALB/SFY Total Ecom TA
Hy-Vee Total Ecom TA
Peapod Total Ecom TA
Target Total Ecom TA: No such file or directory

Some of the solutions have pointed me in the right way, In R I would go about the above task as

valid_values = ['a','b','c']
invalid_retailer = unique(vendorfile$retailer) %not% in valid_values 

I was trying to replicate the same process in shell, and hence my usage of cut and grep.

回答1:

Something like this perhaps?

awk -F, 'NR==FNR { ++a[$1]; next }
    !a[$1] { print FILENAME ":" FNR ": Invalid label " $1 >>"/dev/stderr" }' valid.txt final*.csv

where valid.txt contains your valid labels, one per line.

The general pattern of awk 'NR==FNR { ++a[$1] }' is a common way to read the first of a set of files into an array in memory and then in the remainder of the script perform some sort of join (in the database sense) with fields in the other input files. Awk simply processes one line at a time so the other files can be arbitrarily large really. You do need to be able to store the data from the first file in memory, though.

The advantage over your basic cut+grep attempt is that we can print the entire input line rather than just tell you which labels are invalid and have you go back and manually find out which lines in which files actually contained the violation.

Tangentially, your grep attempt has a number of issues. Firstly, if you are dealing with anything more than toy data, you want to avoid storing your data in shell variables. Secondly, you probably want to tweak your options to tell grep that you want to match text literally (-F -- without this, a.c matches abc because the dot is a regex wildcard character, for example) and that you want matches to cover an entire line (-x -- without this, b matches abc because it is a substring).

cut -d, -f1 final*.csv | sort -u |
grep -vxFf valid.txt

The -f filename option says to read the patterns from a file, and without another file name, grep processes standard input (from the pipe, in this case).



回答2:

Try the awk command, which is much refined.

awk -F',' '{if (($1 == "a") || ($1 == "b") || ($1 == "c") || ($1 == "d")) print $0 }' /stage/Scripts/ecommerce/vendor/final*.csv

Other way::

We can give all the retailer id's in seperate file line by line , say in retailer.txt . The content of retailer.txt be like

a
b

in order to print those line's first field (seperated by ,) matching with the retailer id's in the retailer.txt, use the below command:

awk -F',' 'FNR==NR{$1=a[$1];next} ($1 in a)' retailer.txt final*.csv


回答3:

grep can't do what you want ?

If I understand, call grep on your csv file with the good regex can print all line with a wrong retailer. You need to choose a strong regex to prevent false positive match, but I need input example to help you...

Or, if a regex can't prevent false positive, you can use grep after cut command, like this :

for bad_retailer in $(cut -d, -f1 $f1 | grep d) ; do echo $bad_retailer ; done

with d the name of the bad retailer.

If you want to track more than 1 bad retailer, you can use grep -E "d|g|h", with d and g and h names of the bad retailers.