how to trim file - remove the columns with the sam

2020-03-02 04:41发布

I would like your help on trimming a file by removing the columns with the same value.

# the file I have (tab-delimited, millions of columns)
jack 1 5 9
john 3 5 0
lisa 4 5 7

# the file I want (remove the columns with the same value in all lines)
jack 1 9
john 3 0
lisa 4 7

Could you please give me any directions on this problem? I prefer a sed or awk solution, or maybe a perl solution.

Thanks in advance. Best,

标签: perl unix sed awk
8条回答
闹够了就滚
2楼-- · 2020-03-02 05:01

The main problem here is that you said "millions of columns", and did not specify how many rows. In order to check each value in each row against its counterpart in every other column.. you are looking at a great many checks.

Granted, you would be able to reduce the number of columns as you go, but you would still need to check each one down to the last row. So... much processing.

We can make a "seed" hash to start off with from the two first lines:

use strict;
use warnings;

open my $fh, '<', "inputfile.txt" or die;
my %matches;
my $line = <$fh>;
my $nextline = <$fh>;
my $i=0;
while ($line =~ s/\t(\d+)//) {
    my $num1 = $1;
    if ($nextline =~ s/\t(\d+)//) {
       if ($1 == $num1) { $matches{$i} = $num1 }
    } else {
       die "Mismatched line at line $.";
    }
    $i++;
}

Then with this "seed" hash, you could read the rest of the lines, and remove non-matching values from the hash, such as:

while($line = <$fh>) {
    my $i = 0;
    while ($line =~ s/\t(\d+)//) {
        if (defined $matches{$i}) {
            $matches{$i} = undef if ($matches{$i} != $1);
        }
        $i++;
    }
}

One could imagine a solution where one stripped away all the rows which were already proven to be unique, but in order to do that, you need to make an array of the row, or make a regex, and I am not sure that would not take equally long as simply passing through the string.

Then, after processing all the rows, you would have a hash with the values of duplicated numbers, so you could re-open the file, and do your print:

open my $fh, '<', "inputfile.txt" or die;
open my $outfile, '>', "outfile.txt" or die;
while ($line = <$fh>) {
    my $i = 0;
    if ($line =~ s/^([^\t]+)(?=\t)//) {
        print $outfile $1;
    } else { warn "Missing header at line $.\n"; }
    while ($line =~ s/(\t\d+)//) {
        if (defined $matches{$i}) { print $1 }
        $i++;
    }
    print "\n";
}

This is a rather heavy operation, and this code is untested. This will give you a hint to a solution, it will probably take a while to process the whole file. I suggest running some tests to see if it works with your data, and tweak it.

If you only have a few matching columns, it is much easier to simply extract them from the line, but I hesitate to use split on such long lines. Something like:

while ($line = <$fh>) {
    my @line = split /\t/, $line;
    for my $key (sort { $b <=> $a } keys %matches) {
        splice @line, $key + 1, 1;
    }
    $line = join ("\t", @line);
    $line =~ s/\n*$/\n/; # awkward way to make sure to get a single newline
    print $outfile $line;
}

Note that we would have to sort the keys in descending numerical order, so that we trim values from the end. Otherwise we screw up the uniqueness of the subsequent array numbers.

Anyway, it might be one way to go. It's a rather large operation, though. I'd keep backups. ;)

查看更多
爷、活的狠高调
3楼-- · 2020-03-02 05:08
#!/usr/bin/perl
$/="\t";
open(R,"<","/tmp/filename") || die;
while (<R>)
{
  next if (($. % 4) == 3);
  print;
}

Well, this was assuming it was the third column. If it is by value:

#!/usr/bin/perl
$/="\t";
open(R,"<","/tmp/filename") || die;
while (<R>)
{
  next if (($_ == 5);
  print;
}

With the question edit, OP's desires become clear. How about:

#!/usr/bin/perl
open(R,"<","/tmp/filename") || die;
my $first = 1;
my (@cols);
while (<R>)
{
  my (@this) = split(/\t/);
  if ($. == 1)
  {
    @cols = @this;
  }
  else
  {
    for(my $x=0;$x<=$#cols;$x++)
    {
      if (defined($cols[$x]) && !($cols[$x] ~~ $this[$x]))
      {
        $cols[$x] = undef;
      }
    }
  }
  next if (($_ == 5));
#  print;
}
close(R);
my(@del);
print "Deleting columns: ";
for(my $x=0;$x<=$#cols;$x++)
{
  if (defined($cols[$x]))
  {
    print "$x ($cols[$x]), ";
    push(@del,$x-int(@del));
  }
}
print "\n";

open(R,"<","/tmp/filename") || die;
while (<R>)
{
  chomp;
  my (@this) = split(/\t/);

  foreach my $col (@del)
  {
    splice(@this,$col,1);
  }

  print join("\t",@this)."\n";
}
close(R);
查看更多
登录 后发表回答