Using Perl to compare 2 large files

2019-09-20 01:53发布

问题:

I am comparing 2 large CSV file using Perl that's called in a batch file. I put the result in a 3rd file.

Currently the file contains other information like headers, and other lines like this:

--- file1.txt   Wed Mar  7 14:57:10 2018
+++ file2.txt   Wed Mar  7 13:56:51 2018
@@ -85217,4 +85217,8 @@

How can the result file only contains the difference ? Thank you.

This is my perl:

#!/usr/bin/env perl
use strict; use warnings;
use Text::Diff;
my $diffs = diff 'file1.txt' => 'file2.txt';
print $diffs;

This is my batch file:

perl diffperl.pl > newperl.csv

回答1:

In the unified format,

  • The first two lines indicate the files being compared.
  • Lines that start with "@" indicate location of the differences in the file.
  • Lines that start with a "-" indicates a line that is only in the first file.
  • Lines that start with a "+" indicates a line that is only in the second file.
  • Lines that start with a space indicate a line that is in both files.
  • The output may contain the line "\ No newline at end of file".
  • Every line of in the difference will be newline-terminated, even if the lines of the input aren't.

Solution:

$diffs =~ s/^(?:[^\n]*+\n){2}//;
$diffs =~ s/^[\@ \\][^\n]*+\n//mg;

Note that adding CONTEXT => 0 will reduce the number of lines to remove.


That said, there's not much point in using Text::Diff if you want your own output format. You might as well use Algorithm::Diff directly.

use Algorithm::Diff qw( traverse_sequences );

my $qfn1 = 'file1.txt';
my $qfn2 = 'file2.txt';

my @file1 = do { open(my $fh, '<', $qfn1) or die("Can't open \"$qfn1\": $!\n"); <$fh> };
my @file2 = do { open(my $fh, '<', $qfn2) or die("Can't open \"$qfn2\": $!\n"); <$fh> };

if (@lines1) { chomp($lines1[-1]); $lines1[-1] .= "\n"; }
if (@lines2) { chomp($lines2[-1]); $lines2[-1] .= "\n"; }

traverse_sequences(\@lines1, \@lines2, {
   DISCARD_A => sub { print("-", $lines1[$_[0]]); },
   DISCARD_B => sub { print("+", $lines2[$_[1]]); },
});


回答2:

You should look at the STYLE option in the documentation for Text::Diff. It's possible that one of the built-in styles might be more to your liking. But if that's not the case you could write your own formatting package. It sounds to me like you would just need to supply a hunk_header() method that returns an empty string (as it's the hunk header lines that you don't like).