I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm
and I've (more or less*) successfully created a corpus composed of the *.doc files using this:
ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'),
readerControl=list(reader=readDOC,
language='en_CA',
load=TRUE));
This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a good understanding of the .docx format which I do not currently have).
The readDOC reader uses antiword to parse *.doc files. Is there a similar application that will parse *.docx files?
Or better still, is there already a standard way of creating a corpus of *.docx files using tm?
* more or less, because although the files go in and are readable, I get this warning for every document: In readLines(y, encoding = x$Encoding) : incomplete final line found on 'path/to/a/file.doc'
.docx
files are zipped XML files. If you execute this:
> uzfil <- unzip(file.choose())
And then pick a .docx
file in your directory, you get:
> str(uzfil)
chr [1:13] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels" ...
> uzfil
[1] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels"
[4] "./word/document.xml" "./word/theme/theme1.xml" "./docProps/thumbnail.jpeg"
[7] "./word/settings.xml" "./word/webSettings.xml" "./word/styles.xml"
[10] "./docProps/core.xml" "./word/numbering.xml" "./word/fontTable.xml"
[13] "./docProps/app.xml"
This will also silently unpack all of those files to your working directory. The "./word/document.xml"
file has the words you are looking for, so you can probably read them with one of the XML tools in package XML. I'm guessing you would do something along the lines of :
library(XML)
xtext <- xmlTreeParse(unz(uzfil[4]), useInternalNodes = TRUE) )
Actually you will probably need to save this to a temp-directory and add that path to the file name, "./word/document.xml".
You may want to use the further steps provided by @GaborGrothendieck in this answer: How to extract xml data from a CrossRef using R?
I ended up using docx2txt to convert the .docx files to text. Then I created a corpus from them like this:
ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'),
readerControl=list(reader=readPlain,
language='en_CA',
load=TRUE));
I figure I could probably hack the readDOC reader so that it would use docx2txt or antiword as needed, but this works.