Perl multiline regex for first 3 individual items

2019-09-18 14:41发布

问题:

I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.

For the below single line format I can regex as

/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/

to get the first 3 individual items in line

Hi There       FirstName.LastName    10  3/23/2011 2:46 PM

Below is the multi-line format I see. I am trying to use something like

/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m

to get individual items but don’t seem to work.

Hi There    

                         FirstName-LastName       8       7/17/2015 1:15 PM 

Testing - 12323232323 Hello There

Any suggestions? Is multi-line regex possible?

NOTE: In the same output i can see either Single line or Multi line or both so output can be like below

Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                         Line2FirstName-LastName       8       7/17/2015 1:15 PM 

Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

回答1:

You can for sure apply regex over multiple lines.

I've used the negated word \W+ between words to match space and newlines between words (actually \W is equal to [^a-zA-Z0-9_]). The chat is viewed as a repetead \w+\W+ block.

If you provide more specific input / output case i can refine the example code:

#!/usr/bin/env perl

my $input = <<'__END__';
Hi There    

                         FirstName-LastName       8       7/17/2015 1:15  PM 

Testing - 12323232323 Hello There
__END__

my ($chat,$username,$chars,$timestamp) = $input =~ m/(?im)^\s*((?:\w+\W+)+)(\w+[-,\.]\w+)\W+(\d+)\W+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)/;

$chat =~ s/\s+$//;  #remove trailing spaces

print "chat -> ${chat}\n";
print "username -> ${username}\n";
print "chars -> ${chars}\n";
print "timestamp -> ${timestamp}\n";

Legenda

  • m/^.../ match regex (not substitute type) starting from start of line
  • (?im): case insensitive search and multiline (^/$ match start/end of line also)
  • \s* match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)
  • ((?:\w+\W+)+) (match group $chat) match one or more a pattern composed by a single word \w+ (letters, numbers, '_') followed by not words \W+(everything that is not \w including newline \n). This is later filtered to remove trailing whitespaces
  • (\w+[-,\.]\w+): (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash '-' or a comma ',' (UPDATE) or a dot '.' the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).
  • (\d+): (match group $chars) a number composed by one or more digits
  • ([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m): (match group $timestamp) this is longer than the others split it up:
    • [0-1]?\d\/[0-3]?\d\/[1-2]\d{3} match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)
    • [0-2]?\d:[0-5]?\d\s?[ap]m match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier above

You can test it online here



回答2:

Your regex says:

^\s*(.*)\n*\n*  # line starts with optional space followed by anything 
|      # or
\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$ # spaces followed by any words followed by spaces, digits, spaces,  anything at the end of the line

Consider this:

/^From|To$/

Alternation sticks as close to the sequences. Above is really saying to find a line starting with 'Fro' followed by 'm' or 'T', followed by 'o', followed by the end of line

Compare to this:

    /^(From|To)$/

Above will find lines that only have 'From' or 'To'