I'm trying to extract email addresses from plain text transcripts of emails. I've cobbled together a bit of code to find the addresses themselves, but I don't know how to make it discriminate between them; right now it just spits out all email addresses in the file. I'd like to make it so it only spits out addresses that are preceeded by "From:" and a few wildcard characters, and ending with ">" (because the emails are set up as From [name]<[email]>).
Here's the code now:
import re #allows program to use regular expressions
foundemail = []
#this is an empty list
mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
#do not currently know exact meaning of this expression but assuming
#it means something like "[stuff]@[stuff][stuff1-4 letters]"
# "line" is a variable is set to a single line read from the file
# ("text.txt"):
for line in open("text.txt"):
foundemail.extend(mailsrch.findall(line))
# this extends the previously named list via the "mailsrch" variable
#which was named before
print foundemail
Try this out:
Unfortunately it only finds the first email in each line because it's expecting header lines, but maybe that's ok?
Roughly speaking, you can:
This utilizes the built-in python parseaddr function to parse the address out of the from line (as demonstrated by other answers), without the overhead necessarily of parsing the entire message (e.g. by using the more full featured email and mailbox packages). The script here simply skips any lines that do not begin with "From:". Whether the overhead matters to you depends on how big your input is and how often you will be doing this operation.