Parse MathType MTEF data from OLE binary string

2020-07-18 09:34发布

问题:

There is a need to convert the MathType equations in the MS-WORD 2003 or below to MathML in order to render nicely on the the web. The MathType's built in function "Publish to MathPage" can do the job very nicely, but I want to integrate the equation conversion process in my C# application. Because I couldn't find any API references that the MathPage export interface is provided by the MathType SDK, I need to figure out a way to do the individual equation conversion by myself.

The current procedure is to convert the MS-WORD 2003 or below documents into the Open XML format(docx). After the docx conversion, I can see the MathType embedded ole object binary string is saved in the open xml, which is the docx. Then the next step is to decode the MTEF data from the embedded object binary string, so I tried to extract the MTEF by referring to the official documentation on the MathType MTEF header.

The base64 binary string, representing embedded object created by MathType, is extracted from MS-WORD Test DOCX file.

The MTEF header definition:

MTEF data is saved as the native data format of the object. Whenever an equation object is to be written to an OLE "stream", a 28- byte header is written, followed by the MTEF data. The C struct for this header is as follows:

struct EQNOLEFILEHDR {
    WORD    cbHdr;     // length of header, sizeof(EQNOLEFILEHDR) = 28 bytes
    DWORD   version;   // hiword = 2, loword = 0
    WORD    cf;        // clipboard format ("MathType EF")
    DWORD   cbObject;  // length of MTEF data following this header in bytes
    DWORD   reserved1; // not used
    DWORD   reserved2; // not used
    DWORD   reserved3; // not used
    DWORD   reserved4; // not used
};

The cf member is the return value of a call to the Windows API function RegisterClipboardFormat("MathType EF").

Then I tried to convert it to the C# version:

[StructLayout(LayoutKind.Sequential, Pack=1)]
struct EQNOLEFILEHDR
{
    public UInt16 cbHdr;
    public UInt32 version;
    public UInt16 format;
    public UInt32 size;
    public UInt32 reserved1;
    public UInt32 reserved2;
    public UInt32 reserved3;
    public UInt32 reserved4;
}

With the header struct ready, the following code is trying to fill information in the header struct from the embedded object binary string.

foreach (EmbeddedObjectPart eop in wordDoc.MainDocumentPart.EmbeddedObjectParts)
{
    Stream stream = eop.GetStream();
    byte[] buffer = new byte[int.Parse(stream.Length.ToString())];
    using (BinaryReader reader = new BinaryReader(stream))
    {
        int res = reader.Read(buffer, 0, int.Parse(stream.Length.ToString()));
    }
    GCHandle hdl = GCHandle.Alloc(buffer, GCHandleType.Pinned);
    IntPtr intp = Marshal.AllocHGlobal(buffer.Length);
    Marshal.Copy(buffer, 0, intp, Marshal.SizeOf(typeof(EQNOLEFILEHDR)));
    EQNOLEFILEHDR header = (EQNOLEFILEHDR)Marshal.PtrToStructure(intp, typeof(EQNOLEFILEHDR));
    Marshal.FreeHGlobal(intp);
}

However, the data filled in the header struct isn't correct, making me to think this is not the right approach to parse the MTEF data from the embedded object binary string in the DOCX file.

I have also looked at the sample .NET code in the MathType SDK download, and find the IDataObject is used to contain the MathType information and conversion procedures. So the another approach is to use the BinaryFormatter to see if it can deserialize the binary string to a IDataObject type object, by using the code BinaryFormatter.Deserialize(stream). But it doesn't work either, prompting the exception Binary stream '0' does not contain a valid BinaryHeader

Anything wrong on the methods I tried to use to parse the MTEF data?

回答1:

Kata, you should have received my email reply, but for anyone else interested, we have a sample which is modified from our SDK that we'd be happy to send to anyone who needs it. For anyone using it, it probably won't make much sense if you haven't downloaded the SDK. Please let me know if you'd like to give it a try.

Bob Mathews
Design Science