可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm and I've (more or less*) successfully created a corpus composed of the *.doc files using this:

ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), 
                 readerControl=list(reader=readDOC, 
                                    language='en_CA',
                                    load=TRUE));

This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a good understanding of the .docx format which I do not currently have).

The readDOC reader uses antiword to parse *.doc files. Is there a similar application that will parse *.docx files?

Or better still, is there already a standard way of creating a corpus of *.docx files using tm?

* more or less, because although the files go in and are readable, I get this warning for every document: In readLines(y, encoding = x$Encoding) : incomplete final line found on 'path/to/a/file.doc'

回答1:

.docx files are zipped XML files. If you execute this:

> uzfil <- unzip(file.choose())

And then pick a .docx file in your directory, you get:

> str(uzfil)
 chr [1:13] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels" ...
> uzfil
 [1] "./[Content_Types].xml"          "./_rels/.rels"                  "./word/_rels/document.xml.rels"
 [4] "./word/document.xml"            "./word/theme/theme1.xml"        "./docProps/thumbnail.jpeg"     
 [7] "./word/settings.xml"            "./word/webSettings.xml"         "./word/styles.xml"             
[10] "./docProps/core.xml"            "./word/numbering.xml"           "./word/fontTable.xml"          
[13] "./docProps/app.xml"

This will also silently unpack all of those files to your working directory. The "./word/document.xml" file has the words you are looking for, so you can probably read them with one of the XML tools in package XML. I'm guessing you would do something along the lines of :

 library(XML)
 xtext <-  xmlTreeParse(unz(uzfil[4]), useInternalNodes = TRUE) )

Actually you will probably need to save this to a temp-directory and add that path to the file name, "./word/document.xml".

You may want to use the further steps provided by @GaborGrothendieck in this answer: How to extract xml data from a CrossRef using R?

回答2:

I ended up using docx2txt to convert the .docx files to text. Then I created a corpus from them like this:

ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), 
                 readerControl=list(reader=readPlain, 
                                    language='en_CA',
                                    load=TRUE));

I figure I could probably hack the readDOC reader so that it would use docx2txt or antiword as needed, but this works.