I'm storing Word and Excel documents inside a SQL Server database table. These documents are pulled from the database with my C# application and are put into byte[]
arrays.
I want to replace certain strings found in the Word/Excel documents. What is the best way to do this with the byte array available?
I was looking at something like this:
string fileString = System.Text.Encoding.UTF8.GetString(image.ImageObject);
fileString = fileString.Replace("FROM", "TO");
byte[] newImageObject = System.Text.Encoding.UTF8.GetBytes(fileString);
I would recommend you using the Open XML SDK.
With the library, you can do the following to replace text from a Word document, considering that documentByteArray
is your document byte content taken from database:
using (MemoryStream mem = new MemoryStream())
{
mem.Write(documentByteArray, 0, (int)documentByteArray.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}
The example above was taken from here. You can do similarly with Excel spreadsheets.
I believe you will have to save the bytes as a Word/Excel file and use office automation tools to make the changes.
If you go changing bytes willy-nilly in binary files, you could mess up offsets, checksums, CRC checks, trigger anti-virus software, etc.
Your approach is likely to fail.
If you are talking about .doc and .xls, these file formats are binary, making it most likely that the byte stream contains byte sequences that are not valid UTF-8.
Even if that's not the case, replacing strings of different lengths will make offsets and length fields invalid, thus causing the documents to fail when opening them.
If, on the other hand, you are talking about .docx and .xslx, these files are in fact zipped XML files, which again cannot be simply searched&replaced: just consider that the find string matches an XML element or attribute name (or a part thereof). Again, the replace operation cannot operate on the whole file.