How can I extract columns from a fixed-width forma

2019-07-19 15:40发布

I'm writing a Perl script to run through and grab various data elements such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I can grab each line of this text file no problem.

I have working regex to grab each of those fields. Once I have the line in a variable, i.e. $line - how can I grab each of those fields and place them into their own variables even though they have different delimiters?

6条回答
我欲成王,谁敢阻挡
2楼-- · 2019-07-19 16:15

You can split the line. It appears that your delimiter is just whitespace? You can do something on the order of:

@line = split(" ", $line);

This will match all whitespace. You can then do bounds checking and access each field via $line[0], $line[1], etc.

Split can also take a regular expression rather than a string as a delimiter as well.

@line = split(/\s+/, $line);

This might do the same thing.

查看更多
家丑人穷心不美
3楼-- · 2019-07-19 16:18

This example illustrates how to parse the line either with whitespace as the delimiter (split) or with a fixed-column layout (unpack). With unpack if you use upper-case (A10 etc), whitespace will be removed for you. Note: as brian d foy points out, the split approach does not work well for a situation with missing fields (for example, the second line of data), because the field position information will be lost; unpack is the way to go here, unless we are misunderstanding your data.

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
查看更多
老娘就宠你
4楼-- · 2019-07-19 16:23

I'm unsure of the column names and formatting but you should be able to adjust this recipe to your liking using Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
查看更多
孤傲高冷的网名
5楼-- · 2019-07-19 16:38

Use my module DataExtract::FixedWidth. It is the most full featured, and well tested, for working with Fixed Width columns in perl. If this isn't fast enough you can pass in an unpack_string and eliminate the need for heuristic detection of boundaries.

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000
查看更多
叛逆
6楼-- · 2019-07-19 16:38

If all fields have the same fixed width and are formatted with spaces, you can use the following split:

@array = split / {1,N}/, $line;

where N is the with of the field. This will yield a space for each empty field.

查看更多
我只想做你的唯一
7楼-- · 2019-07-19 16:38

Fixed width delimiting can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

My Perl is very rusty so I am sure there are syntax errors there. but that is the gist of it.

查看更多
登录 后发表回答