I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.
For the below single line format I can regex as
/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/
to get the first 3 individual items in line
Hi There FirstName.LastName 10 3/23/2011 2:46 PM
Below is the multi-line format I see. I am trying to use something like
/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m
to get individual items but don’t seem to work.
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Any suggestions? Is multi-line regex possible?
NOTE: In the same output i can see either Single line or Multi line or both so output can be like below
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM
Hello Line2
Line2FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM
You can for sure apply regex over multiple lines.
I've used the negated word
\W+
between words to match space and newlines between words (actually\W
is equal to[^a-zA-Z0-9_]
). The chat is viewed as a repetead\w+\W+
block.If you provide more specific input / output case i can refine the example code:
Legenda
m/^.../
match regex (not substitute type) starting from start of line(?im)
: case insensitive search and multiline (^/$ match start/end of line also)\s*
match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)((?:\w+\W+)+)
(match group $chat) match one or more a pattern composed by a single word\w+
(letters, numbers, '_') followed by not words\W+
(everything that is not\w
including newline\n
). This is later filtered to remove trailing whitespaces(\w+[-,\.]\w+)
: (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash'-'
or a comma','
(UPDATE) or a dot'.'
the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).(\d+)
: (match group $chars) a number composed by one or more digits([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m)
: (match group $timestamp) this is longer than the others split it up:[0-1]?\d\/[0-3]?\d\/[1-2]\d{3}
match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)[0-2]?\d:[0-5]?\d\s?[ap]m
match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier aboveYou can test it online here
Your regex says:
Consider this:
Alternation sticks as close to the sequences. Above is really saying to find a line starting with 'Fro' followed by 'm' or 'T', followed by 'o', followed by the end of line
Compare to this:
Above will find lines that only have 'From' or 'To'