Compare 4 files line by line to see if they match

2019-09-20 02:15发布

问题:

I'm trying to compare 4 text files for counts in each line:

file1.txt:
32
44
75
22
88

file2.txt
32
44
75
22
88

file3.txt
11
44
75
22
77

file4.txt
    32
    44
    75
    22
    88

each line represents a title

line1 = customerID count
line2 = employeeID count
line3 = active_users
line4 = inactive_users
line5 = deleted_users

I'm trying to compare file2.txt, file3.txt and file4.txt with file1.txt; file1.txt will always have the correct counts.

Example: Since file2.txt matches exactly line by line to file1.txt in the example above then i'm trying to output "file2.txt is good" but since file3.txt line1 and line5 do not match to file1.txt I'm trying to output "customerID for file3.txt does not match by 21 records", (i.e. 32 - 11 = 21), and "deleted_users in file3.txt does not match by 11 records", (88 - 77 = 11).

If shell is easier then that is fine too.

回答1:

One way to process files by lines in parallel

use warnings;
use strict;
use feature 'say';

my @files = @ARGV;
#my @files = map { $_ . '.txt' } qw(f1 f2 f3 f4);  # my test files' names

# Open all files, filehandles in @fhs
my @fhs = map { open my $fh, '<', $_  or die "Can't open $_: $!"; $fh } @files;

# For reporting, enumerate file names
my %files = map { $_ => $files[$_] } 0..$#files;

# Process (compare) the same line from all files       
my $line_cnt;
LINE: while ( my @line = map { my $line = <$_>; $line } @fhs )
{
    defined || last LINE for @line;
    ++$line_cnt;
    s/(?:^\s+|\s+$)//g for @line;
    for my $i (1..$#line) {
        if ($line[0] != $line[$i]) { 
            say "File $files[$i] differs at line $line_cnt"; 
        }
    }
}

This compares the whole line by == (after leading and trailing spaces are stripped), since it is a given that each line carries a single number which need be compared.

It prints, with my test files named f1.txt, f2.txt, ...

File f3.txt differs at line 1
File f3.txt differs at line 5


回答2:

Store the line names in an array, store the correct values in another array. Then, loop over the files, and for each of them, read their lines and compare them to the stored correct values. You can use the special variable $. that contains the line number of the last access file handle to serve as an index to the arrays. Lines are 1-based, arrays are 0-based, so we need to subtract 1 to get the correct index.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @line_names = ('customerID count',
                  'employeeID count',
                  'active_users',
                  'inactive_users',
                  'deleted_users');

my @correct;
open my $in, '<', shift or die $!;
while (<$in>) {
    chomp;
    push @correct, $_;
}

while (my $file = shift) {
    open my $in, '<', $file or die $!;
    while (<$in>) {
        chomp;
        if ($_ != $correct[$. - 1]) {
            say "$line_names[$. - 1] in $file does not match by ",
                $correct[$. - 1] - $_, ' records';
        }
    }
}


回答3:

Read first file into array then loop over other files using the same function to read into array. Within this loop consider every line, calc diff and print message with text from @names if diff is not zero.

#!/usr/bin/perl

use strict;
use warnings;

my @names = qw(customerID_count employeeID_count active_users inactive_users deleted_users);
my @files = qw(file1.txt file2.txt file3.txt file4.txt);

my @first = readfile($files[0]);

for (my $i = 1; $i <= $#files; $i++) {
    print "\n$files[0] <=> $files[$i]:\n";
    my @second = readfile($files[$i]);
    for (my $j = 0; $j <= $#names; $j++) {
        my $diff = $first[$j] - $second[$j];
        $diff = -$diff if $diff < 0;
        if ($diff > 0) {
            print "$names[$j] does not match by $diff records\n";
        }
    }
}

sub readfile {
    my ($file) = @_;
    open my $handle, '<', $file;
    chomp(my @lines = <$handle>);
    close $handle;
    return grep(s/\s*//g, @lines);
}

Output is:

file1.txt <=> file2.txt:

file1.txt <=> file3.txt:
customerID_count does not match by 21 records
deleted_users does not match by 11 records

file1.txt <=> file4.txt:


回答4:

A mash-up of bash, and mostly the GNU versions of standard utils like diff, sdiff, sed, et al, plus the ifne util, and even an eval:

f=("" "customerID count" "employeeID count" \
   "active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do 
    diff -qws file1.txt $n || 
    $(sdiff file1 $n | ifne -n exit | nl | 
      sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' | 
      xargs printf 'eval echo "%s for '"$n"' does not match by %s records.";\n') ; 
done

Output:

Files file1.txt and file2.txt are identical
Files file1.txt and file3.txt differ
customerID count for file3.txt does not match by 21 records.
deleted_users for file3.txt does not match by 11 records.
Files file1.txt and file4.txt are identical

The same code, tweaked for prettier output:

f=("" "customerID count" "employeeID count" \
   "active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do 
    diff -qws file1.txt $n || 
    $(sdiff file1 $n | ifne -n exit | nl | 
      sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' | 
      xargs printf 'eval echo "%s does not match by %s records.";\n') ; 
done  | 
sed '/^Files/!s/^/\t/;/^Files/{s/.* and //;s/ are .*/ is good/;s/ differ$/:/}'

Output:

file2.txt is good
file3.txt:
    customerID count does not match by 21 records.
    deleted_users does not match by 11 records.
file4.txt is good


回答5:

Here is an example in Perl:

use feature qw(say);
use strict;
use warnings;

{
    my $ref = read_file('file1.txt');
    my $N = 3;
    my @value_info;
    for my $i (1..$N) {
        my $fn = 'file'.($i+1).'.txt';
        my $values = read_file( $fn );
        push @value_info, [ $fn, $values];
    }
    my @labels = qw(customerID employeeID active_users inactive_users deleted_users);
    for my $info (@value_info) {
        my ( $fn, $values ) = @$info;
        my $all_ok = 1;
        my $j = 0;
        for my $value (@$values) {
            if ( $value != $ref->[$j] ) {
                printf "%s: %s does not match by %d records\n",
                  $fn, $labels[$j], abs( $value - $ref->[$j] );
                $all_ok = 0;
            }
            $j++;
        }
        say "$fn: is good" if $all_ok;
    }
}

sub read_file {
    my ( $fn ) = @_;

    my @values;
    open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
    while( my $line = <$fh>) {
        if ( $line =~ /(\d+)/) {
            push @values, $1;
        }
    }
    close $fh;
    return \@values;
}

Output:

file2.txt: is good
file3.txt: customerID does not match by 21 records
file3.txt: deleted_users does not match by 11 records
file4.txt: is good