How can I debug a corrupt docx file?

2019-03-15 08:59发布

I have an issue where .doc and .pdf files are coming out OK but a .docx file is coming out corrupt.

In order to solve that I am trying to debug why the .docx is corrupt.

I learned that the docx format is much stricter with regard to extra characters than either .pdf or .doc. Therefore I have searched the various xml files WITHIN the docx file looking for invalid XML. But I can't find any. It all validates fine.

xml files I've been checking out

Could anyone suggest directions for me to investigate now?

UPDATE:

The full listing of files inside the folder is as follows:

/_rels
    .rels

/customXml
    /_rels
        .rels
    item1.xml
    itemProps1.xml

/docProps
    app.xml
    core.xml

/word
    /_rels
        document.xml.rels
    /media
        image1.jpeg
    /theme
        theme1.xml
    document.xml
    fontTable.xml
    numbering.xml
    settings.xml
    styles.xml
    stylesWithEffects.xml
    webSettings.xml

[Content_Types].xml

UPDATE 2:

I should also have mentioned that the reason for corruption is almost certainly a bad binary file POST on my behalf.

why are docx files corrupted by binary post, but .doc and .pdf are fine?

UPDATE 3:

I have tried the demo various docx repair tools. They all seem to repair the file ok but give no clue as to the cause of the error.

My next step is to examine the contents of the corrupted file with the repaired version.

If anybody knows of a docx repair tool that gives a decent error message I'd appreciate hearing about it. In fact I might post that as a separate question.

UPDATE 4 (2017)

I never solved this problem. I have tried all the tools suggested in the answers below but none of them worked for me.

I have since progressed a little further and found a block of 0000 missing when opening the .docx in Sublime Text. More details in the new question here: What could be causing this corruption in .docx files during httpwebrequest?

4条回答
Summer. ? 凉城
2楼-- · 2019-03-15 09:25

Many years late, but I found this which actually worked for me. (From https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

(wordDoc is a WordprocessingDocument)

using DocumentFormat.OpenXml.Validation;

        try
        {
            var validator = new OpenXmlValidator();
            var count = 0;
            foreach (var error in validator.Validate(wordDoc))
            {
                count++;
                Console.WriteLine("Error " + count);
                Console.WriteLine("Description: " + error.Description);
                Console.WriteLine("ErrorType: " + error.ErrorType);
                Console.WriteLine("Node: " + error.Node);
                Console.WriteLine("Path: " + error.Path.XPath);
                Console.WriteLine("Part: " + error.Part.Uri);
                Console.WriteLine("-------------------------------------------");
            }

            Console.WriteLine("count={0}", count);
        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
查看更多
Explosion°爆炸
3楼-- · 2019-03-15 09:30

I used the "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425) to find a problem with a broken hyperlink reference.

You have to download/install the SDK first, then the tool. The tool will open and analyze the document for problems.

查看更多
走好不送
4楼-- · 2019-03-15 09:37

web docx validator worked for me : http://ucd.eeonline.org/validator/index.php

查看更多
叼着烟拽天下
5楼-- · 2019-03-15 09:40

Usually, when there is an error with a particular XML file, Word tells you on which line of which file the error happens. So I believe the problem comes from either the Zipping of the file, either the folder structure.

Here is the folder structure of a Word file:

The .docx format is a zipped file that contains the following folders:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

It seems that you have only what is inside the word folder, isn't it ? If this doesn't work, could you please either send the corrupted Docx or post the structure of your folders inside your zip ?

查看更多
登录 后发表回答