I get an input file vendor.csv which has a column called retailer. I have a predefined list of valid retailer values which are a,b,c. If 'd' comes in the retailer column I will have to take some action , mostly echo it to a log and stop the processing and notify the user.
I have done the following so far
f1=/stage/Scripts/ecommerce/vendor/final*.csv
k=`cut -d, -f1 $f1 |sort -u`
echo $k
This gives me
a b c d
The above o/p is not comma seperated
I can store the valid values a b c in a file or a string , for the above case
How do I make a check now ? Is this the best way to go about this
the valid values are ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
The existing data contains the following unique data points
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
So the "Hy-Vee Total Ecom TA" is an invalid value.
Here is my attempt with grep
$ echo $s
ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
echo $k
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA
grep -v "$s" "$k"
It gave me an error
grep: ALB/SFY Total Ecom TA
Hy-Vee Total Ecom TA
Peapod Total Ecom TA
Target Total Ecom TA: No such file or directory
Some of the solutions have pointed me in the right way, In R I would go about the above task as
valid_values = ['a','b','c']
invalid_retailer = unique(vendorfile$retailer) %not% in valid_values
I was trying to replicate the same process in shell, and hence my usage of cut and grep.
Try the awk command, which is much refined.
We can give all the retailer id's in seperate file line by line , say in
retailer.txt
. The content ofretailer.txt
be likein order to print those line's first field (seperated by ,) matching with the retailer id's in the
retailer.txt
, use the below command:Something like this perhaps?
where
valid.txt
contains your valid labels, one per line.The general pattern of
awk 'NR==FNR { ++a[$1] }'
is a common way to read the first of a set of files into an array in memory and then in the remainder of the script perform some sort of join (in the database sense) with fields in the other input files. Awk simply processes one line at a time so the other files can be arbitrarily large really. You do need to be able to store the data from the first file in memory, though.The advantage over your basic
cut
+grep
attempt is that we can print the entire input line rather than just tell you which labels are invalid and have you go back and manually find out which lines in which files actually contained the violation.Tangentially, your
grep
attempt has a number of issues. Firstly, if you are dealing with anything more than toy data, you want to avoid storing your data in shell variables. Secondly, you probably want to tweak your options to tellgrep
that you want to match text literally (-F
-- without this,a.c
matchesabc
because the dot is a regex wildcard character, for example) and that you want matches to cover an entire line (-x
-- without this,b
matchesabc
because it is a substring).The
-f
filename option says to read the patterns from a file, and without another file name,grep
processes standard input (from the pipe, in this case).grep
can't do what you want ?If I understand, call
grep
on your csv file with the good regex can print all line with a wrong retailer. You need to choose a strong regex to prevent false positive match, but I need input example to help you...Or, if a regex can't prevent false positive, you can use grep after cut command, like this :
with d the name of the bad retailer.
If you want to track more than 1 bad retailer, you can use
grep -E "d|g|h"
, with d and g and h names of the bad retailers.