I have an issue where .doc and .pdf files are coming out OK but a .docx file is coming out corrupt.
In order to solve that I am trying to debug why the .docx is corrupt.
I learned that the docx format is much stricter with regard to extra characters than either .pdf or .doc. Therefore I have searched the various xml files WITHIN the docx file looking for invalid XML. But I can't find any. It all validates fine.
Could anyone suggest directions for me to investigate now?
UPDATE:
The full listing of files inside the folder is as follows:
/_rels
.rels
/customXml
/_rels
.rels
item1.xml
itemProps1.xml
/docProps
app.xml
core.xml
/word
/_rels
document.xml.rels
/media
image1.jpeg
/theme
theme1.xml
document.xml
fontTable.xml
numbering.xml
settings.xml
styles.xml
stylesWithEffects.xml
webSettings.xml
[Content_Types].xml
UPDATE 2:
I should also have mentioned that the reason for corruption is almost certainly a bad binary file POST on my behalf.
why are docx files corrupted by binary post, but .doc and .pdf are fine?
UPDATE 3:
I have tried the demo various docx repair tools. They all seem to repair the file ok but give no clue as to the cause of the error.
My next step is to examine the contents of the corrupted file with the repaired version.
If anybody knows of a docx repair tool that gives a decent error message I'd appreciate hearing about it. In fact I might post that as a separate question.
UPDATE 4 (2017)
I never solved this problem. I have tried all the tools suggested in the answers below but none of them worked for me.
I have since progressed a little further and found a block of 0000
missing when opening the .docx in Sublime Text. More details in the new question here: What could be causing this corruption in .docx files during httpwebrequest?
Usually, when there is an error with a particular XML file, Word tells you on which line of which file the error happens. So I believe the problem comes from either the Zipping of the file, either the folder structure.
Here is the folder structure of a Word file:
The .docx
format is a zipped file that contains the following folders:
+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels
It seems that you have only what is inside the word folder, isn't it ? If this doesn't work, could you please either send the corrupted Docx or post the structure of your folders inside your zip ?
I used the "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425) to find a problem with a broken hyperlink reference.
You have to download/install the SDK first, then the tool. The tool will open and analyze the document for problems.
Many years late, but I found this which actually worked for me. (From https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)
(wordDoc is a WordprocessingDocument
)
using DocumentFormat.OpenXml.Validation;
try
{
var validator = new OpenXmlValidator();
var count = 0;
foreach (var error in validator.Validate(wordDoc))
{
count++;
Console.WriteLine("Error " + count);
Console.WriteLine("Description: " + error.Description);
Console.WriteLine("ErrorType: " + error.ErrorType);
Console.WriteLine("Node: " + error.Node);
Console.WriteLine("Path: " + error.Path.XPath);
Console.WriteLine("Part: " + error.Part.Uri);
Console.WriteLine("-------------------------------------------");
}
Console.WriteLine("count={0}", count);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
web docx validator worked for me : http://ucd.eeonline.org/validator/index.php