How to diff parts of lines?

2020-03-24 07:08发布

问题:

I have two files that I want to diff. The lines have timestamps and possibly some other stuff I would like to ignore for the matching algorithm, but I still want those items output if the matching algorithm finds a difference in the rest of the text. For example:

1c1
<    [junit4] 2013-01-11 04:43:57,392 INFO  com.example.MyClass:123 [main] [loadOverridePropFile] Config file application.properties not found: java.io.FileNotFoundException: /path/to/application.properties (No such file or directory)
---
>    [junit4] 2013-01-11 22:16:07,398 INFO  com.example.MyClass:123 [main] [loadOverridePropFile] Config file application.properties not found: java.io.FileNotFoundException: /path/to/application.properties (No such file or directory)

SHOULD NOT be emitted but:

1c1
<    [junit4] 2013-01-11 04:43:57,392 INFO  com.example.MyClass:123 [main] [loadOverridePropFile] Config file application.properties not found: java.io.FileNotFoundException: /path/to/application.properties (No such file or directory)
---
>    [junit4] 2013-01-11 22:16:07,398 INFO  com.example.MyClass:456 [main] [loadOverridePropFile] Config file application.properties not found: java.io.FileNotFoundException: /path/to/application.properties (No such file or directory)

SHOULD be emitted (since the line numbers are different). Note that the timestamps are still emitted.

How can this be done?

回答1:

I wished for this feature a couple of times before myself, and since it popped up here again I decided to google around a bit and found perl's Algorithm::Diff which you can feed a hashing function (they call it "key generation functions") which "should return a string that uniquely identifies a given element" that the algorithm uses to do the comparison (instead of the actual content that you feed it with).

Basically, all you need to do is add a sub that does some regex magic in a way that you desire to filter out unwanted stuff from your string and add the subref as parameter to the call to diff() (see my CHANGE 1 and CHANGE 2 comments in the snippet below).

If you require normal (or unified) diff output, check the elaborate diffnew.pl example that the module ships with and do the necessary changes in this file. For demonstration purposes, I will use the simple diff.pl that it also ships with since it is short and I can fully post it here.

mydiff.pl

#!/usr/bin/perl

# based on diff.pl that ships with Algorithm::Diff
# demonstrates the use of a key generation function

# the original diff.pl is:
# Copyright 1998 M-J. Dominus. (mjd-perl-diff@plover.com)
# This program is free software; you can redistribute it and/or modify it
# under the same terms as Perl itself.

use Algorithm::Diff qw(diff);

die("Usage: $0 file1 file2") unless @ARGV == 2;

my ($file1, $file2) = @ARGV;

-f $file1 or die("$file1: not a regular file");
-f $file2 or die("$file2: not a regular file");
-T $file1 or die("$file1: binary file");
-T $file2 or die("$file2: binary file");

open (F1, $file1) or die("Couldn't open $file1: $!");
open (F2, $file2) or die("Couldn't open $file2: $!");
chomp(@f1 = <F1>);
close F1;
chomp(@f2 = <F2>);
close F2;

# CHANGE 1
# $diffs = diff(\@f1, \@f2);
$diffs = diff(\@f1, \@f2, \&keyfunc);

exit 0 unless @$diffs;

foreach $chunk (@$diffs)
{
        foreach $line (@$chunk)
        {
                my ($sign, $lineno, $text) = @$line;
                printf "%4d$sign %s\n", $lineno+1, $text;
        }
}
exit 1;

# CHANGE 2 {
sub keyfunc
{
        my $_ = shift;
        s/^(\d{2}:\d{2})\s+//;
        return $_;
}
# }

one.txt

12:15 one two three
13:21 three four five

two.txt

10:01 one two three
14:38 seven six eight

example run

$ ./mydiff.pl one.txt two.txt
   2- 13:21 three four five
   2+ 13:21 seven six eight

example run 2

And here is one in normal diff output based on the diffnew.pl

$ ./my_diffnew.pl one.txt two.txt
2c2
< 13:21 three four five
---
> 13:21 seven six eight

As you can see, the first line in either file gets ignored because they only differ in their timestamp and the hashing function removes those for the comparison.

Voilà, you just rolled your own content-aware diff!



回答2:

Assuming that your files are "a.txt" and "b.txt". You can get it using diff + cut this way:

diff <(cut -d" " -f4-99 a.txt) <(cut -d" " -f4-99 b.txt)

Each cut ignores the first 3 fields (related to date and this stuff) and only takes into account the rest of the line (from column 4 to 99). Cut should work using:

cut -d" " -f4- a.txt

But it does not work for me, so I added -f4-99. So we apply cut to both inputs to ignore date fields and then we run diff to compare them as you want.



标签: shell