I have a sequence of .csv files, which contain columnar (5 columns) data separated by a white space. Filenames are in this format 'yyyymmdd.csv'. File format for example as given below:
content of 20161201.csv
key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value
content of 20161202.csv
key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value
content of 20161203.csv
key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value
I want to compare a file with date 'D' to the file with date 'D+1', based on value column. Then I am interested in those two consecutive files which have maximum number of rows different. So like here if I compare 20161201.csv with 20161202.csv, I get only 2nd row mismatching
(123457 20000 some value and 123457 80000 some value, mismatched because of 20000 != 80000)
then if I compare 20161202.csv with 20161203.csv, I get 2 rows mismatching(1st and 2nd rows)
Hence, 20161202.csv and 20161203.csv are my target files here.
I am looking for a sequence of bash commands which can do the same.
PS: Number of rows in a file are huge(around 3000) and you can assume that all files have same year and month(number of files<30).
Without checking if the filenames respect the date comparison rule (data file vs date+1 file), you could do something like this:
The core of getting non matching lines between two files is grep. You can try manually the grep to see that the results are correct:
grep options:
-v : returns non matched lines (reverse operation of grep)
-F : fixed -not regex - match
-w : word match to avoid 5000 to match with 50000
-f : load patterns from file, in particular from file1, field2.With this pattern we will grep/search field2 of file2.
wc -l : counts the matches = non matched lines <(cut -d' ' -f2 file2) : We grep the field2 of file2 and not the whole file2 to avoid possible matches of file1/field2 in other columns of file2 than column2
Alternative Solution with awk
Instead of grep , you can use an awk like this:
This will print the same results as of
grep -v
file1/field2($2) will be loaded in an array
a
lines of file2/field2 ($2) that are not in this array (non matching fields) will be printed.
Can be also piped to
|wc -l
to count the non-matching lines , as in grep.So if you prefer to use awk, this line :
must be changed to:
In any case, it seems that you need an array to hold the filenames and then you need a loop to iterate through the files and compare them in pairs.
Well, that was a kind of chalenge to implement.
With bellow code , based purely on awk (gnu awk actually), all we need is a starting point / a starting file1. Then awk automatically will get the next file2 (by adding 1 day) and compare those two files for different lines.
If there are missing files from the chain , then script readjust file names of files1 and 2 to check for different lines in adjacent files by respecting the rule of +1 day.
You should normally be able to run the script even with copy paste (works in my bash even with comments included) or you can save the code in a separate file (i.e test.awk) that will be loaded by awk with the -f switch (awk -f test.awk).
Output:
PS: You can remove all the print parts of awk and keep only the last summarize command.
Hope this code to be helpfull and well working.