Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.
os.stat
returns only size and datetime, real-file related information.
open(filename, 'rb').read(200)
returns many characters that I could not parse.
There is a module called xlrd
for reading xlsx
files. Yet, this still doesn't let me read doc
or docx
files. I am aware of new office files are not easily read on non-msoffice
programs, so if that's impossible, gathering info from old office files would suffice.
Since docx
files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator
in docProps/core.xml
.
Here's how you can open the docx file and retrieve the creator:
import zipfile, lxml.etree
# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text
You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
The secret when working with any of the office objects is knowing what item to access from the overwhelming amount of methods and properties. In this case each document has a list of BuiltInDocumentProperties . The property of interest is "Last Author".
After you open the document you will access the author with something like word.ActiveDocument.BuiltInDocumentProperties("Last Author")
For old office documents you could use hachoir-metadata.
I use it daily in a script and it works flawlessly.
But I don't know wether it works with the new file formats.
How about using docx
library. You could pull more information about the file not only author.
#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx
import docx
file_name = 'file_path_name.doxs'
document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)
find more information about the docx library here and the github account is here