I need to compare two office documents, in this case two word documents and provide a difference, which is somewhat similar to what is show in SVN. Not to that extent, but at least be able to highlight the differences.
I tried using the office COM dll and got this far..
object fileToOpen = (object)@"D:\doc1.docx";
string fileToCompare = @"D:\doc2.docx";
WRD.Application WA = new WRD.Application();
Document wordDoc = null;
wordDoc = WA.Documents.Open(ref fileToOpen, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing);
wordDoc.Compare(fileToCompare, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing);
Any tips on how to proceed further? This will be a web application having a lot of hits. Is using the office com object the right way to go, or are there any other things I can look at?
I agree w/ Joseph about diff'ing the string. I would also recommend a purpose-built diffing engine (several found here: Any decent text diff/merge engine for .NET?) which can help you avoid some of the normal pitfalls in diffing.
You should use Document class to compare files and open in a Word document the result.
using OfficeWord = Microsoft.Office.Interop.Word;
object fileToOpen = (object)@"D:\doc1.docx";
string fileToCompare = @"D:\doc2.docx";
var app = Global.OfficeFile.WordApp;
object readOnly = false;
object AddToRecent = false;
object Visible = false;
OfficeWord.Document docZero = app.Documents.Open(fileToOpen, ref missing, ref readOnly, ref AddToRecent, Visible: ref Visible);
docZero.Final = false;
docZero.TrackRevisions = true;
docZero.ShowRevisions = true;
docZero.PrintRevisions = true;
//the OfficeWord.WdCompareTargetNew defines a new file, you can change this valid value to change how word will open the document
docZero.Compare(fileToCompare, missing, OfficeWord.WdCompareTarget.wdCompareTargetNew, true, false, false, false, false);
So my requirements were that I had to use a .Net lib and I wanted to avoid working on actual files but work with streams.
ZipArchive is in System.IO.Compressed
What I did and it worked out quite nicely was using the ZipArchive from .Net and comparing contents while skipping the .rels file because it seems the it is randomly generated on each file creation. Here's my snippet:
private static bool AreWordFilesSame(byte[] wordA, byte[] wordB)
{
using (var streamA = new MemoryStream(wordA))
using (var streamB = new MemoryStream(wordB))
using (var zipA = new ZipArchive(streamA))
using (var zipB = new ZipArchive(streamB))
{
streamA.Seek(0, SeekOrigin.Begin);
streamB.Seek(0, SeekOrigin.Begin);
for(int i = 0; i < zipA.Entries.Count; ++i)
{
Assert.AreEqual(zipA.Entries[i].Name, zipB.Entries[i].Name);
if (zipA.Entries[i].Name.EndsWith(".rels")) //These are some weird word files with autogenerated hashes
{
continue;
}
var streamFromA = zipA.Entries[i].Open();
var streamFromB = zipB.Entries[i].Open();
using (var readerA = new StreamReader(streamFromA))
using (var readerB = new StreamReader(streamFromB))
{
var bytesA = readerA.ReadToEnd();
var bytesB = readerB.ReadToEnd();
if (bytesA != bytesB || bytesA.Length == 0)
{
return false;
}
}
}
return true;
}
}
You should really be extracting the doc into a string and diff'ing that.
You only care about the textual changes and not the formatting right?
For a solution on a server, or running without an installation of Word and using the COM tools, you could use the WmlComparer component of XmlPowerTools.
The documentation is a bit limited, but here's an example usage:
var expected = File.ReadAllBytes(@"c:\expected.docx");
var actual = File.ReadAllBytes(@"c:\result.docx");
var expectedresult = new WmlDocument("expected.docx", expected);
var actualDocument = new WmlDocument("result.docx", actual);
var comparisonSettings = new WmlComparerSettings();
var comparisonResults = WmlComparer.Compare(expectedresult, actualDocument, comparisonSettings);
var revisions = WmlComparer.GetRevisions(comparisonResults, comparisonSettings);
which will show you the differences between the two documents.
To do a comparison between Word documents, you need
- A library to manipulate Word document, e.g. read paragraphs, text, tables etc from a Word file. You can try Office Interop, OpenXML or Aspose.Words for .NET.
- An algorithm/library to do the actual comparison, on the text retrieved from both Word documents. You can write your own or use a library like DiffMatchPatch or similar.
This question is old, now there are more solutions like GroupDocs Compare available.
Document Comparison by Aspose.Words for .NET is an open source showcase project that uses Aspose.Words and DiffMatchPatch for comparison.
I work at Aspose as a Developer Evangelist.