可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to compare two huge csv files with a thousand of entries like bellow:

id;val

1;a
2;b 
3;c

Ans second file has the following structure

id1;entry    
1;002
2;x90 
5;d07

The desired result is to match and combine the lines with the same value for id/id1 and create a third csv file with only matched entries showing bellow:

idR;valR;entryR
1;a;002
2;b;x90

To accomplish this I can load each file in a distinct database table and perform a select like this to retrieve all matched values:

select tb1.id, tb1.val tb2.entry 
  from tb1, tb2
 where tb1.id = tb2.1

At once I can retrieve all values desired with this approach.

But let's suppose these files could be sorted and in this way use it's possible to use awk to print the results for a entries with the same values for id and id1. The best that I can do is to create two associative arrays for each value and perform a binary search using awk and sed/cut?

It's possible to load these two files and combine them at once to produce a final csv file with the results?

Or I can to this with perl with standard lib?

回答1:

Can do this with standard join utility

file1.txt

1 a
2 b
3 c

file2.txt

1 002
2 x90
5 d07

join example

join -1 1 -2 1 -o 1.1,1.2,2.2 file1.txt file2.txt

here join is joining from file1.field1 to file2.field2 and outputting the fields specified with the -o flag

output

1 a 002
2 b x90

回答2:

Load the information in memory with awk and then print the line if the id matches:

$ awk 'FNR==NR {a[$1]=$2; next} ($1 in a) {print $1, $2, a[$1]}' f2 f1
1 a 002
2 b x90

Explanation

The basic idea is to do things when reading file1 and other_things when reading file2:

awk 'FNR==NR {things; next} {other_things}' file1 file2

In our case, things is to store the content of file2 in memory, mapping every id to its value.

Then, it goes through file1 and prints the content of the line, together with the mapped value, if there is a common id.

回答3:

Size is the difficult part, as to merge files you may need to read in the whole lot.

However for a general solution the the problem in perl:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV;

my %count_of;
my @field_order;

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or die $!;
    my $header_row = $csv->getline($input);
    foreach my $header (@$header_row) {
        if ( not $count_of{$header} ) {
            push( @field_order, $header );
        }
        $count_of{$header}++;
    }
}

print "Common headers:\n";
my @common_headers = grep { $count_of{$_} >= @ARGV } keys %count_of;
print join( "\n", @common_headers );

my %lookup_row;
my $key_field;
if (@common_headers) { $key_field = shift @common_headers };

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or die $!;
    my @headers = @{ $csv->getline($input) };
    $csv->column_names(@headers);
    while ( my $row_hr = $csv->getline_hr($input) ) {
        my $key = $.;
        if ($key_field) {
            $key = $row_hr->{$key_field};
        }
        $lookup_row{$key}{$file} = $row_hr;
    }
    close($input);
}

my $csv_out = Text::CSV->new( { binary => 1 } );
my $header_row = \@field_order;
$csv_out->print( \*STDOUT, $header_row );
print "\n";

foreach my $key ( sort keys %lookup_row ) {
    my %combined_row;
    foreach my $file ( sort keys %{ $lookup_row{$key} } ) {
        foreach my $header (@field_order) {
            if ( $lookup_row{$key}{$file}{$header} ) {
                if (   not defined $combined_row{$header}
                    or not $combined_row{$header} eq
                    $lookup_row{$key}{$file}{$header} )
                {
                    $combined_row{$header}
                        .= $lookup_row{$key}{$file}{$header};
                }
            }
        }
    }
    my @row = @combined_row{@field_order};
    $csv_out->print( \*STDOUT, \@row );
    print "\n";
}

Note that Text::CSV can be altered to redirect output to a file handle rather than STDOUT which is probably not what you want for large files (or y'know, just > output.csv. )

You can also configure the delimiter for Text::CSV via sep_char:

my $csv = Text::CSV -> new ( { binary => 1, sep_char => "\t" } );

I was unclear what your separator was, so have assumed comma (as you refer to csv).

Script above will pick out a common field and merge on that, or line number if none exists.

Note:

This script reads files into memory and merges them there, sorting and joining on a common key. It will sort based on this for output. It's therefore memory greedy, but should 'just work' in a lot of cases. Just specify the filenames splice.pl file1.csv file2.csv file3.csv

If there is a common field in these files, it'll join on those and output in order. If there isn't, it'll use line number.