How should I go about reading a .docx file with Python and being able to recognize the italicized text and storing it as a string?
I looked at the docx python package but all I see is features for writing to a .docx file.
I appreciate the help in advance
Here's what my example document, TestDocument.docx
, looks like.
Note: The word "Italic" is in Italics, but "Emphasis" uses the style, Emphasis.
If you install the python-docx
module. This is a fairly simple exercise.
>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
... for run in p.runs:
... print run.text, run.italic, run.bold
...
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]
The Run.italic
attribute captures whether the text is formatted as Italic, but it doesn't know if a text block has a Style that is rendered in Italic, but it can be detected by checking Run.style.name (if you know what styles in your document are rendered in Italics.
>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]
Your best bet is going about unzipping the docx which will create a directory called word. Within that directory is document.xml, from there you would need to learn the xml structure and key words to be able to read just an italicized text. once you complete that all you have to do is pull the text string from xml file.