可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I want to remove all from, to, cc, subject sent tags from this text document and only keep the body of the mail so that I can use this to summarize content of the document. What is the best way to do this in python. I think it's better to first do the extraction and then use preprocessing for this case. Also attaching code here. So if anyone can suggest how to do this, would be really helpful. The payload and ismultipart part of the file is not done properly and there is where my doubt is and so have commented that part and require help there.
Attaching code and the .txt file below for reference.
import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords
# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
try:
for files in filename:
file = open(filename, 'r', encoding ='utf-8')
filecontents = file.read()
filecontents = re.sub(r'\s+', ' ', filecontents)
print(filecontents)
filecontents = filecontents.strip('\n')
b = email.message_from_string(filecontents)# NEED
if b.is_multipart():#HELP
for payload in b.get_payload():#HERE
# if payload.is_multipart(): ...#SO
print (payload.get_payload())#COMMENTED
else:#
print (b.get_payload())#
summary = summarize(filecontents, ratio =0.10)
print(summary)
kw = keywords(filecontents, words=15)
print(kw)
break
#writer.writerow([file, summary, kw])
except Exception as e:
pass
TEXT FILE
Stephanie /ANN
From: Mr.A, <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE: Holdings: XXXX SPA – mfno.1322
Dear Dr. Tim A. ,
The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other
than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal
of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any
applications submitted. We will send an administrative filing issue letter for both the holder and the agent.
Thank you!
Regards,
Mr.A
PRODUCT Master File
CDER
Currently, there is no requirement to submit or resubmit NAs in any electronic format. However, starting May 5, 2018,
new NAs, as well as any submissions to the existing NAs mANNt be submitted electronically in legal (electronic Common
Technical Document) format specified by GROUP A in the legal guidance. NA submissions that are not submitted in legal
format after this date may be subject to rejection. For more information please check the NA website
www.GROUP A.gov/abc/bca
This communication is an informal communication consistent with which represents my best judgment
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication,
including any attachments, is intended only for the person or entity to which it is addressed and may contain
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the
sender and delete the material from any computer. Thank you.
From: Tim.@xxxx.com [mailto:Tim.@xxxx.com]
Sent: Wednesday, July 25, 2018 2:10 PM
To: Mr.A, <.Mr.A@abc.com>
Cc: May.Abd@xxxx.com
Subject: RE: Holdings: XXXX SPA ‐ dm 013383
Dear ,
XXXX
2
Thanks for your phone call to clarify your needs and to understand the situation. I have confirmed that Xxxx only does
direct bANNiness for test S intermediate with b. and not with the other companies (e,
x, etc.) that are secondary companies. Based on our discANNsion, I believe that we do not need to
provide QAs for these secondary companies or mention them in our NA file as they would be covered under a
separate QA S.p.A. to them. If this is correct, then I believe you mentioned that we have two options as
described below:
Option 1: We can issue a separate QA for each . NA to be specific on which NA is being cross‐referenced
to our NA 13383.
Option 2: We can do a single QA for and mention that they can cross‐reference any of their NAs. This
would allow them to cross‐reference any of their
If I have misunderstood or am incorrect in my response and we need to discANNs further, please let me know.
If not, when you issue your request, can you please send to me and May Abd by email?
Kind regards.
Tim
Tim A. , BsC
Director, YY SERVICES)
Xxxx ANN
Phone/FAX: 2312333
Cell: 23312123131
Email: tim.@xxxx.com
From: , Tim /ANN
Sent: Monday, July 23, 2018 7:05 AM
To: 'Mr.A, '
Cc: Abd, May /ANN
Subject: RE: [EXTERNAL] Holder: XXXX SPA - NA 013383
Dear ,
May is now on vacation and I am covering for her during her absence. Is there a good time to call you today or later this
week? Please let me know and we can schedule or please call my cell phone 21313131231 at your convenience.
Kind regards.
Tim
Tim A. , MSC
Director, PQR
Xxxx
Phone/FAX: 2312313313
Cell: 3142342424
Email: tim.@xxxx.com
XXXX
3
‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
From: "Mr.A, " <.Mr.A@abc.com>
Date: Jul 20, 2018 9:01 AM
Subject: [EXTERNAL] Holder: XXXX SPA ‐ NA 013383
To: "TRETE/ANN" <May.Abd@xxxx.com>
Cc: "mno.com>
Dear May Abd,
. I need to talk to you on this.
Thank you!
Regards,
Mr.A
PRODUCT Master File
CDER
Currently, there is no requirement to submit or resubmit NAs in any electronic format.
format after this date may be subject to rejection. For more information please check the NA website
www.GROUP A./cder/NA
This communication is an informal communication which represents my best judgment
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication,
including any attachments, is intended only for the person or entity to which it is addressed and may contain
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the
sender and delete the material from any computer. Thank you.
XXXX
回答1:
It's not really clear which part of the code you need help with, what you want it to do instead of what it currently does, or how to pass on the results for further processing correctly.
However, I will note that your code has a number of problems.
- You cannot read an email message as UTF-8 text. Regardless of the file extension, an RFC822 message is simply a sequence of bytes. Traditional email could come in a large number of different encodings, and if you try to coerce it into UTF-8, you will run into
UnicodeDecodeError
s and other snags.
- As always, a blanket
except Exception:
is a major bug. Perhaps you only put this in for debugging, but it actually makes debugging harder.
- Typical modern email messages come with somewhat complex MIME body structures which you have to analyze in context before you decide which one(s) you actually want to process. One common phenomenon is
multipart/alternative
where the same message is rendered in different formats so that recipients can decide whether they want to read it rendered as HTML, plain text, or, occasionally, perhaps PDF or RTF or a single image or whatever, depending on the application. Also, HTML structures often have multiple parts, because the main HTML wants to pull in small images which are supplied in the MIME structure as well (company logo, animated emojis, and other insults to the reader). Perhaps see also What are the "parts" in a multipart email?
Another complication for this answer is that Python's email
library went through an overhaul relatively recently. The new functionality was introduced experimentally in Python 3.3, but only became the documented and default version in 3.6. Most of the code you will find out in the wild will be using the pre-3.6 facilities, but going forward, you will probably want to target the new and improved API.
With the legacy API your code might look something like
from email import message_from_binary_file
for filename in glob.glob(os.path.join(dirs, '*.txt')):
# Not useful; we already have a filename
#for files in filename:
# Open in binary mode, don't try to guess encoding
# Use a context manager so we don't leave the file open
with open(filename, 'rb') as file:
# Just let the email library take it from here
#filecontents = file.read()
#filecontents = re.sub(r'\s+', ' ', filecontents)
#print(filecontents)
#filecontents = filecontents.strip('\n')
b = email.message_from_binary_file(file)
if b.is_multipart():
# There are a number of things you could do to pick out
# one or more payloads for analysis, but let's just take
# the first text/plain part and call it "main_part"
for part in b.walk()
if part.get_content_type() == 'text/plain':
main_part = part.get_payload()
break
else:
main_part = b.get_payload()
summary = summarize(main_part, ratio =0.10)
print(summary)
kw = keywords(main_part, words=15)
print(kw)
To use the new 3.6+ API you will need to adapt this to something like
from email.policy import default as default_email_policy
...
b = email.message_from_binary_file(file, policy=default_email_policy)
main_part = b.get_body(['related', 'plain', 'html'])
This will result in a new email.message.EmailMessage
object which has some different methods and different behaviors than the legacy email.message.Message
class. The documentation suggests that maybe one day the default policy
will be passed in by default, at which point old code will switch to new behavior (but also probably some amount of unpleasant surprises and outright breakage).
Notice also the get_body()
method which is new in 3.6 and which lets you easily pick out a "probable main part"; though if no text/plain
part is available, the code above will fall back to HTML, which you will then need to process further to extract the actual text (look at Beautifulsoup maybe?)
There is no technical, robust, reliable way to separate boilerplate (headers, signatures, etc) from actual content in email. Some HTML email clients might provide hints in the generated message as to which <div>
contains things the user typed in, but in the general case, you just have to wade up to your eyebrows in (frankly, hopeless) heuristics.
回答2:
If you want to only remove the From, Sent, To, Cc, Subject and Forwarded tags from the email you could use regex.
import re
with open('email_input.txt', 'r') as input:
lines = input.readlines()
no_new_lines = [i.strip() for i in lines]
for line in no_new_lines:
email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Forwarded message).*)', re.IGNORECASE)
remove_component = re.findall(email_component, line)
if remove_component:
print(line)
# output
‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
From: Mr.A, <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE: Holdings: XXXX SPA – mfno.1322
Concerning removing the content after 'regards'. I didn't add that to my regex, because emails can be signed severals ways. Here are some of the most common ways:
Best,
Best regards,
Best wishes,
Fond regards,
Kind regards,
Regards,
Sincerely,
Sincerely yours,
Thank you,
With appreciation,
With gratitude,
Yours sincerely,
UPDATED ANSWER ONE
The updated answer below cleans some more of your email input, but more cleaning is required.
import re
with open('email_input.txt', 'r') as input:
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# regex to catch header lines
email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Date:|Forwarded message).*)', re.IGNORECASE)
remove_headers = [x for x in no_new_lines if not email_component.findall(x)]
# regex to catch greeting lines
greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
remove_greeting = [x for x in remove_headers if not greeting_component.findall(x)]
# regex to catch lines with contact details
contact_component = re.compile(r'(Phone.*:)|(Cell:.*)|(Email:.*)', re.IGNORECASE)
remove_contacts = [x for x in remove_greeting if not contact_component.findall(x)]
# regex to catch lines with salutation
email_salutation_component = re.compile(r'Best,(.*?)|Best regards,(.*?)|Best wishes,(.*?)|Fond regards,(.*?)|'
r'Kind regards(.*?)|Regards,(.*?)|Sincerely,(.*?)|Sincerely yours,(.*?)|'
r'Thank you,(.*?)|With appreciation,(.*?)|Yours sincerely,(.*?)', re.IGNORECASE)
remove_salutations = [x for x in remove_contacts if not email_salutation_component.findall(x)]
# do something else
UPDATED ANSWER TWO
The updated answer below uses the python email library. My input file was an original email message pulled from my email client. Using the code below, I was able to extract the body of every email message that I tried. I also tested the gensim module and it worked correctly.
import email
from gensim.summarization import summarize, keywords
with open('email_input.txt', 'r') as input:
email_body = ''
raw_message = input.read()
# Return a message object structure from a string
msg = email.message_from_string(raw_message)
# iterate over all the parts and subparts of a message object tree
for part in msg.walk():
# Return the message’s content type.
if part.get_content_type() == 'text/plain':
email_body = part.get_payload()
summary = summarize(email_body, ratio=0.10)
print(summary)
kw = keywords(email_body, words=15)
print(kw)
FINAL ANSWER
This is my final answer to this question. Hopefully, one of these 4 answers meets your requirements.
You will have to do some small cleanup of the output, because I don't know all your requirements.
with open('email_input.txt') as infile:
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in infile:
# Look for lines that start with a greeting
if line.startswith("Dear"):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("Regards") or line.startswith("Kind regards"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
print (line.rstrip('\n'))
# output
The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any applications submitted. We will send an administrative filing issue letter for both the holder and the agent.
more here....