How to get two files having max difference among a

2019-09-18 06:01发布

问题:

I have a sequence of .csv files, which contain columnar (5 columns) data separated by a white space. Filenames are in this format 'yyyymmdd.csv'. File format for example as given below:

content of 20161201.csv

key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value

content of 20161202.csv

key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value

content of 20161203.csv

key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value

I want to compare a file with date 'D' to the file with date 'D+1', based on value column. Then I am interested in those two consecutive files which have maximum number of rows different. So like here if I compare 20161201.csv with 20161202.csv, I get only 2nd row mismatching

(123457 20000 some value and 123457 80000 some value, mismatched because of 20000 != 80000)

then if I compare 20161202.csv with 20161203.csv, I get 2 rows mismatching(1st and 2nd rows)

Hence, 20161202.csv and 20161203.csv are my target files here.

I am looking for a sequence of bash commands which can do the same.

PS: Number of rows in a file are huge(around 3000) and you can assume that all files have same year and month(number of files<30).

回答1:

Without checking if the filenames respect the date comparison rule (data file vs date+1 file), you could do something like this:

while IFS= read -r -d '' fn;do files+=("$fn");done < <(find . -name '201612*.csv' -print0) 
#Load all filenames in an array. Using null separation we ensure that filenames will be  
#handled correctly no matter if they do contain spaces or other special chars.

max=0
for ((i=0;i<"${#files[@]}"-1;i++));do #iterate through the filenames array
  a="${files[i]}";b="${files[i+1]}" #compare file1 with file2, file2 with file3, etc - in series
  differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
  echo "comparing $a vs $b - non matching lines=$differences" #Just for testing - can be removed .
  [[ "$max" -lt "$differences" ]] && max="$differences" && ahold="$a" && bhold="$b" #When we have the max differences we keep the names of the files
done

echo "max differences found=$max between $ahold and $bhold" #reporting max differences and in which files found

The core of getting non matching lines between two files is grep. You can try manually the grep to see that the results are correct:

grep -v -F -w -f <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2) 

grep options:
-v : returns non matched lines (reverse operation of grep)
-F : fixed -not regex - match
-w : word match to avoid 5000 to match with 50000
-f : load patterns from file, in particular from file1, field2.With this pattern we will grep/search field2 of file2.
wc -l : counts the matches = non matched lines <(cut -d' ' -f2 file2) : We grep the field2 of file2 and not the whole file2 to avoid possible matches of file1/field2 in other columns of file2 than column2

Alternative Solution with awk

Instead of grep , you can use an awk like this:

awk 'NR==FNR{a[$2];next}!($2 in a)' file1 file2

This will print the same results as of grep -v

file1/field2($2) will be loaded in an array a
lines of file2/field2 ($2) that are not in this array (non matching fields) will be printed.

Can be also piped to |wc -l to count the non-matching lines , as in grep.

So if you prefer to use awk, this line :

differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)

must be changed to:

differences=$(awk 'NR==FNR{a[$2];next}!($2 in a)' $a $b |wc -l)

In any case, it seems that you need an array to hold the filenames and then you need a loop to iterate through the files and compare them in pairs.



回答2:

Well, that was a kind of chalenge to implement.

With bellow code , based purely on awk (gnu awk actually), all we need is a starting point / a starting file1. Then awk automatically will get the next file2 (by adding 1 day) and compare those two files for different lines.

If there are missing files from the chain , then script readjust file names of files1 and 2 to check for different lines in adjacent files by respecting the rule of +1 day.

You should normally be able to run the script even with copy paste (works in my bash even with comments included) or you can save the code in a separate file (i.e test.awk) that will be loaded by awk with the -f switch (awk -f test.awk).

awk -v file1="20161201.csv" \
'function incfile(file,days)                                        #function receives two arguments: file and days
    {
    match(file,/(....)(..)(..)/,fn);                                #splits the string of file to format fn[1]=YYYY,fn[2]=MM and fn[3]=DD
    newfile=sprintf("%s%s%02d%s",fn[1],fn[2],fn[3]+days,".csv");    #this function increase the filename by days variable
    return (newfile)                                                #i.e file 20161201.csv returns 20161201+days
    };
BEGIN \
{
    chkdays=1; 
    while (chkdays<=15)
    {
        {
        file2=incfile(file1,1);                                     #Built filename of file2 by increasing file1 +1 day
        if (getline < file2 < 0)                                    #Check if file2 exists
            {
            print file1,"vs",file2,"skipped:",file2 "  not found";  #Print a help message - can be removed
            chkdays=chkdays+2;                                      #increase days counter for the while loop by 2
            file1=incfile(file1,2);                                 #Increase filename of file1 by 2 days (20161201 will be 20161203)
            file2=incfile(file2,2);                                 #The same for filename of file2 (20161202 will be 20161204)
            }
        else                                                        #if file2 exists
            {
            close(file2);                                           
            print "comparing",file1,"vs",file2; 
            while (getline var <file1)                              #read from file1 a line and assign it to var
                {split(var,ff1,OFS);a[ff1[2]]};                     #split line from file 2 (var) to fields, and keep the field2 in an array as index
            while (getline var2 <file2)
                {
                split(var2,ff2,OFS);                                #same for file2.split the line read (var2) 
                if (!(ff2[2] in a)) {print ">",var2;l=l+1};         #check if ff2[2] (file2-field2) is not found on the array created by file1-field2
                }
            if (l>maxd) {maxd=l;maxp=file1 " vs " file2};           #hold/save max different lines found and hold also the files that maxd was found
            file1=file2;                                            #Assign file2 to be file1 in order to repeat the loop
            chkdays=chkdays+1;                                      #Increase check days counter by 1
            delete a;l=0;close(file1);close(file2)                  #unset all necessary vars and close files
            }
        }
    };                                                              #End of BEGIN section
    print "max different lines=",maxd,"found at pair:",maxp         #Print the results
}'                                                                  #Finished

Output:

comparing 20161201.csv vs 20161202.csv
> 123457 80000 some value
comparing 20161202.csv vs 20161203.csv
> 123456 50000 some value
> 123457 70000 some value
20161203.csv vs 20161204.csv skipped: 20161204.csv  not found
20161205.csv vs 20161206.csv skipped: 20161206.csv  not found
20161207.csv vs 20161208.csv skipped: 20161208.csv  not found
20161209.csv vs 20161210.csv skipped: 20161210.csv  not found
comparing 20161211.csv vs 20161212.csv
> 123457 80000 some value
> 123458 15000 some value
> 123458 16000 some value
> 123458 17000 some value
comparing 20161212.csv vs 20161213.csv
> 123456 50000 some value
> 123457 70000 some value
> 123458 20000 some value
> 123458 25000 some value
> 123458 35000 some value
20161213.csv vs 20161214.csv skipped: 20161214.csv  not found
comparing 20161215.csv vs 20161216.csv
max different lines= 5 found at pair: 20161212.csv vs 20161213.csv

$ cat 20161212.csv
123456 10000 some value
123457 80000 some value
123458 30000 some value
123458 15000 some value
123458 16000 some value
123458 17000 some value

$ cat 20161213.csv
123456 50000 some value
123457 70000 some value
123458 20000 some value
123458 15000 some value
123458 25000 some value
123458 35000 some value

# csv files 01,02,03 are copy paste from your OP. file 11 is a copy of file 01.

PS: You can remove all the print parts of awk and keep only the last summarize command.

Hope this code to be helpfull and well working.



标签: linux bash csv