I'm trying to parse a whatsapp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.
The chat.txt file looks like this:
[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::
While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: :
as the sender.
Here is the regex I am working with so far:
pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')
Any advice on how I could go around this bug would be appreciated!