Perform a different regular expression for each co

2019-08-12 20:24发布

I found myself writing PERL for the first time in about 8 years and I am having difficulties with something that should be easy. Here is the basic premise:

A file containing a hundred or so fields 10 of which have incorrect data (the O's are 0's)

A   B   C   D    E  F   ... 
br0wn   red   1278076   0range   "20 tr0ut"   123 ...
Green   0range   90876   Yell0w   "18 Salm0n"   456   ...

I am trying to write the program to split the fields and then allow me to run a regex on field A to replace 0 with O but not replace 0 with O for column C and so on I have the additional problem of needing to possibly run an alternate regex for column E for instance.

I was able to split all the fields in a record by the /t. I am having an issue formatting my command to go over each field and run a specific regex based on the field it is.

Any help would be appreciated and I will Paypal you 10 dollars for a beverage of your choice if you solve it.

6条回答
劳资没心,怎么记你
2楼-- · 2019-08-12 20:35

I'd probably use Perl in 'autosplit' mode:

perl -a -p -F"\t" \
     -e '$F[0] =~ s/0/o/g;
         $F[1] =~ s/0/O/g;
         $F[3] =~ s/0/o/g;
         $F[4] =~ s/(\D)0(\D)/\1o\2/g;  # Or other more complex regex
         # ...                          # Other fields can be edited
         $_ = join("\t", @F);           # Reassign fields to $_
        ' data-file

The regex for $F[4] changes '20 tr0ut' into '20 trout'; you can make it more complex if you need.

Output on sample data:

A       B       C       D       E       F       ...
brown   red     1278076 orange  "20 trout"      123     ...
Green   Orange  90876   Yellow  "18 Salmon"     456     ...

This does assume a strictly tab-separated data file. The quoted strings containing spaces complicate things if you do not have strictly tab-separated data; at that point, Text::CSV is attractive for reading the lines.

查看更多
来,给爷笑一个
3楼-- · 2019-08-12 20:42

Here's one way with a simple configuration using array references and/or subroutines, then the substitutions happening later:

use strict;
use warnings;

my @subst = ([
  ['this', 'that'],
  ['O', 1],
],[
  ['foo', 'boo'],
  sub {s/a.*//},
]);

sub mk_subst {
  my $list = shift;
  my ($this, $that) = eval { @$list };
  return $list unless defined $this;
  sub { s/\Q$this/$that/ };
}

my @all;
for my $set (@subst) {
  my @list = eval { @$set };
  unless (@list) {
    push @all, [ sub {} ];
    next;
  }
  my @re;
  for my $s (@list) {
    push @re, mk_subst($s);
  }
  push @all, \@re;
}

while (<DATA>) {
  chomp;
  my @list = split /\t/, $_, -1;
  for my $i (0..$#list) {
    for ($list[$i]) {
      for my $funcs ($all[$i]) {
        for my $f (@$funcs) {
          $f->();
        }
      }
    }
  }
  print join("\t", @list), "\n";
}

__DATA__
thisO   fooabca1234
abc 123fooabca1234
查看更多
手持菜刀,她持情操
4楼-- · 2019-08-12 20:46

Using a csv parser such as Text::CSV is not complicated. Something like this might suffice:

use strict;
use warnings;
use Text::CSV;

my $csv = Text::CSV->new({
        sep_char    => "\t",
        binary      => 1,
        eol         => $/,
});
while (my $row = $csv->getline(*DATA)) {
    tr/0/o/ for @{$row}[0, 1, 3];            # replace in cols A, B and D
    s/(?<!\d)0(?!\d)/o/g for @{$row}[4];     # replace in col E
    $csv->print(*STDOUT, $row);              # print the result
}


__DATA__
A   B   C   D   E   F
br0wn   red 1278076 0range  "20 tr0ut"  123
Green   0range  90876   Yell0w  "18 Salm0n" 456

Output:

A       B       C       D       E       F
brown   red     1278076 orange  "20 trout"      123
Green   orange  90876   Yellow  "18 Salmon"     456

Note that I handled your mixed string (column E) with a simplistic regex instead of transliteration (global replace), and it simply does not replace zeroes which are next to numbers, which will fail for certain numbers, such as 20.0 or 0.

Update:

If you want to do the substitutions based on column names instead of position, things get a bit more complicated. However, Text::CSV can handle it.

use strict;
use warnings;
use Text::CSV;

my @pure_text   = qw(A B D);
my @mixed       = qw(E);

my $csv = Text::CSV->new({
        sep_char    => "\t",
        binary      => 1,
        eol     => $/,
});

my $cols = $csv->getline(*DATA);              # read column names
$csv->print(*STDOUT, $cols);
$csv->column_names($cols);                    # set column names

while (my $row = $csv->getline_hr(*DATA)) {   # hash ref instead of array ref
    tr/0/o/ for @{$row}{@pure_text};          # substitution on hash slice
    s/(?<!\d)0(?!\d)/o/g for @{$row}{@mixed};
    my @row = @{$row}{@$cols};                # make temp array for printing
    $csv->print(*STDOUT, \@row);
}


__DATA__
A   B   C   D   E   F
br0wn   red 1278076 0range  "20 tr0ut"  123
Green   0range  90876   Yell0w  "18 Salm0n" 456

This code is a standalone for demonstration. To try the code on a file, change *DATA to *STDIN and use the script as follows:

perl script.pl < input.csv
查看更多
家丑人穷心不美
5楼-- · 2019-08-12 20:46

Here's one way using GNU awk. Simply add the column names into the array in the BEGIN block. In the example below, only columns A, C and E will be modified. Run like:

awk -f script.awk file

Contents of script.awk:

BEGIN {
    FS=OFS="\t"

    a["A"]
    a["C"]
    a["E"]
}

{
    for (i=1;i<=NF;i++) {

        if ($i in a && NR==1) {
            b[i]
        }

        else if (i in b) {
            $i = gensub(/(^|[^0-9])0([^0-9]|$)/,"\\1o\\2", "g", $i)
        }
    }
}1

Tab separated results:

A   B   C   D   E   F   ... 
brown   red 1278076 0range  "20 trout"  123 ...
Green   0range  90876   Yell0w  "18 Salmon" 456 ...

Alternatively, here's the one-liner:

awk 'BEGIN { FS=OFS="\t"; a["A"]; a["C"]; a["E"] } { for (i=1;i<=NF;i++) { if ($i in a && NR==1) b[i]; else if (i in b) $i = gensub(/(^|[^0-9])0([^0-9]|$)/,"\\1o\\2", "g", $i) } }1' file
查看更多
贪生不怕死
6楼-- · 2019-08-12 20:50

Create an array of subroutines, something like:

my @fixer;
$fixer[0] = sub { $_[0] =~ s/0/o/; };
my @fields = split /\t/, $input;
for (my $i = 0; $i <= $#fields; $i++) {
   $fixer[$i]->($fields[$i]) if defined $fixer[$i];
}
查看更多
forever°为你锁心
7楼-- · 2019-08-12 20:54
perl -F -lane 'for(@F){$_=~s/0/o/g if(/0/ && /[a-zA-Z]+/);} print "@F"' your_file

Tested below

> cat temp
br0wn   red   1278076   0range   "20 tr0ut"   123 ...
Green   0range   90876   Yell0w   "18 Salm0n"   456   ...

> perl -F -lane 'for(@F){$_=~s/0/o/g if(/0/ && /[a-zA-Z]+/);} print "@F"' temp
brown red 1278076 orange "20 trout" 123 ...
Green orange 90876 Yellow "18 Salmon" 456 ...
>
查看更多
登录 后发表回答