I get an input file vendor.csv which has a column called retailer.
I have a predefined list of valid retailer values which are a,b,c. If 'd' comes in the retailer column I will have to take some action , mostly echo it to a log and stop the processing and notify the user.
I have done the following so far
f1=/stage/Scripts/ecommerce/vendor/final*.csv
k=`cut -d, -f1 $f1 |sort -u`
echo $k
This gives me
a b c d
The above o/p is not comma seperated
I can store the valid values a b c in a file or a string , for the above case
How do I make a check now ? Is this the best way to go about this
the valid values are ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
The existing data contains the following unique data points
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
So the "Hy-Vee Total Ecom TA" is an invalid value.
Here is my attempt with grep
$ echo $s
ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
echo $k
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
grep -v "$s" "$k"
It gave me an error
grep: ALB/SFY Total Ecom TA
Hy-Vee Total Ecom TA
Peapod Total Ecom TA
Target Total Ecom TA: No such file or directory
Some of the solutions have pointed me in the right way, In R I would go about the above task as
valid_values = ['a','b','c']
invalid_retailer = unique(vendorfile$retailer) %not% in valid_values
I was trying to replicate the same process in shell, and hence my usage of cut and grep.
Something like this perhaps?
awk -F, 'NR==FNR { ++a[$1]; next }
!a[$1] { print FILENAME ":" FNR ": Invalid label " $1 >>"/dev/stderr" }' valid.txt final*.csv
where valid.txt
contains your valid labels, one per line.
The general pattern of awk 'NR==FNR { ++a[$1] }'
is a common way to read the first of a set of files into an array in memory and then in the remainder of the script perform some sort of join (in the database sense) with fields in the other input files. Awk simply processes one line at a time so the other files can be arbitrarily large really. You do need to be able to store the data from the first file in memory, though.
The advantage over your basic cut
+grep
attempt is that we can print the entire input line rather than just tell you which labels are invalid and have you go back and manually find out which lines in which files actually contained the violation.
Tangentially, your grep
attempt has a number of issues. Firstly, if you are dealing with anything more than toy data, you want to avoid storing your data in shell variables. Secondly, you probably want to tweak your options to tell grep
that you want to match text literally (-F
-- without this, a.c
matches abc
because the dot is a regex wildcard character, for example) and that you want matches to cover an entire line (-x
-- without this, b
matches abc
because it is a substring).
cut -d, -f1 final*.csv | sort -u |
grep -vxFf valid.txt
The -f
filename option says to read the patterns from a file, and without another file name, grep
processes standard input (from the pipe, in this case).
Try the awk command, which is much refined.
awk -F',' '{if (($1 == "a") || ($1 == "b") || ($1 == "c") || ($1 == "d")) print $0 }' /stage/Scripts/ecommerce/vendor/final*.csv
Other way::
We can give all the retailer id's in seperate file line by line , say in retailer.txt
. The content of retailer.txt
be like
a
b
in order to print those line's first field (seperated by ,) matching with the retailer id's in the retailer.txt
, use the below command:
awk -F',' 'FNR==NR{$1=a[$1];next} ($1 in a)' retailer.txt final*.csv
grep
can't do what you want ?
If I understand, call grep
on your csv file with the good regex can print all line with a wrong retailer.
You need to choose a strong regex to prevent false positive match, but I need input example to help you...
Or, if a regex can't prevent false positive, you can use grep after cut command, like this :
for bad_retailer in $(cut -d, -f1 $f1 | grep d) ; do echo $bad_retailer ; done
with d the name of the bad retailer.
If you want to track more than 1 bad retailer, you can use grep -E "d|g|h"
, with d and g and h names of the bad retailers.