可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How should I go about reading a .docx file with Python and being able to recognize the italicized text and storing it as a string?

I looked at the docx python package but all I see is features for writing to a .docx file.

I appreciate the help in advance

回答1:

Here's what my example document, TestDocument.docx, looks like.

Note: The word "Italic" is in Italics, but "Emphasis" uses the style, Emphasis.

If you install the python-docx module. This is a fairly simple exercise.

>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
...     for run in p.runs:
...             print run.text, run.italic, run.bold
... 
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]

The Run.italic attribute captures whether the text is formatted as Italic, but it doesn't know if a text block has a Style that is rendered in Italic, but it can be detected by checking Run.style.name (if you know what styles in your document are rendered in Italics.

>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]

回答2:

Your best bet is going about unzipping the docx which will create a directory called word. Within that directory is document.xml, from there you would need to learn the xml structure and key words to be able to read just an italicized text. once you complete that all you have to do is pull the text string from xml file.