I need to compare two huge csv files with a thousand of entries like bellow:
id;val
1;a
2;b
3;c
Ans second file has the following structure
id1;entry
1;002
2;x90
5;d07
The desired result is to match and combine the lines with the same value for id/id1 and create a third csv file with only matched entries showing bellow:
idR;valR;entryR
1;a;002
2;b;x90
To accomplish this I can load each file in a distinct database table and perform a select like this to retrieve all matched values:
select tb1.id, tb1.val tb2.entry
from tb1, tb2
where tb1.id = tb2.1
At once I can retrieve all values desired with this approach.
But let's suppose these files could be sorted and in this way use it's possible to use awk to print the results for a entries with the same values for id and id1. The best that I can do is to create two associative arrays for each value and perform a binary search using awk and sed/cut?
It's possible to load these two files and combine them at once to produce a final csv file with the results?
Or I can to this with perl with standard lib?
Can do this with standard join
utility
file1.txt
1 a
2 b
3 c
file2.txt
1 002
2 x90
5 d07
join example
join -1 1 -2 1 -o 1.1,1.2,2.2 file1.txt file2.txt
here join is joining from file1.field1 to file2.field2 and outputting the fields specified with the -o flag
output
1 a 002
2 b x90
Load the information in memory with awk
and then print the line if the id
matches:
$ awk 'FNR==NR {a[$1]=$2; next} ($1 in a) {print $1, $2, a[$1]}' f2 f1
1 a 002
2 b x90
Explanation
The basic idea is to do things
when reading file1
and other_things
when reading file2
:
awk 'FNR==NR {things; next} {other_things}' file1 file2
In our case, things
is to store the content of file2
in memory, mapping every id to its value.
Then, it goes through file1
and prints the content of the line, together with the mapped value, if there is a common id.
Size is the difficult part, as to merge files you may need to read in the whole lot.
However for a general solution the the problem in perl:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my %count_of;
my @field_order;
foreach my $file (@ARGV) {
my $csv = Text::CSV->new( { binary => 1 } );
open( my $input, "<", $file ) or die $!;
my $header_row = $csv->getline($input);
foreach my $header (@$header_row) {
if ( not $count_of{$header} ) {
push( @field_order, $header );
}
$count_of{$header}++;
}
}
print "Common headers:\n";
my @common_headers = grep { $count_of{$_} >= @ARGV } keys %count_of;
print join( "\n", @common_headers );
my %lookup_row;
my $key_field;
if (@common_headers) { $key_field = shift @common_headers };
foreach my $file (@ARGV) {
my $csv = Text::CSV->new( { binary => 1 } );
open( my $input, "<", $file ) or die $!;
my @headers = @{ $csv->getline($input) };
$csv->column_names(@headers);
while ( my $row_hr = $csv->getline_hr($input) ) {
my $key = $.;
if ($key_field) {
$key = $row_hr->{$key_field};
}
$lookup_row{$key}{$file} = $row_hr;
}
close($input);
}
my $csv_out = Text::CSV->new( { binary => 1 } );
my $header_row = \@field_order;
$csv_out->print( \*STDOUT, $header_row );
print "\n";
foreach my $key ( sort keys %lookup_row ) {
my %combined_row;
foreach my $file ( sort keys %{ $lookup_row{$key} } ) {
foreach my $header (@field_order) {
if ( $lookup_row{$key}{$file}{$header} ) {
if ( not defined $combined_row{$header}
or not $combined_row{$header} eq
$lookup_row{$key}{$file}{$header} )
{
$combined_row{$header}
.= $lookup_row{$key}{$file}{$header};
}
}
}
}
my @row = @combined_row{@field_order};
$csv_out->print( \*STDOUT, \@row );
print "\n";
}
Note that Text::CSV
can be altered to redirect output to a file handle rather than STDOUT
which is probably not what you want for large files (or y'know, just > output.csv
. )
You can also configure the delimiter for Text::CSV
via sep_char
:
my $csv = Text::CSV -> new ( { binary => 1, sep_char => "\t" } );
I was unclear what your separator was, so have assumed comma (as you refer to csv
).
Script above will pick out a common field and merge on that, or line number if none exists.
Note:
This script reads files into memory and merges them there, sorting and joining on a common key. It will sort based on this for output. It's therefore memory greedy, but should 'just work' in a lot of cases. Just specify the filenames splice.pl file1.csv file2.csv file3.csv
If there is a common field in these files, it'll join on those and output in order. If there isn't, it'll use line number.