-->

There are a “binary dump” or “get binary represent

2020-07-27 04:02发布

问题:

I need to access the internal binary representation of a loaded XML DOM... There are some dump functions, but I not see something like "binary buffer" (there are only "XML buffers").

My last objective is to compare byte-by-byte, the same document, before and after some black-box procedure, directly with their binary (current and cached) representations, without convertion (to XML-text representation)... So, the question,

There are a binary representation (in-memory structures) in LibXML2, to compare dump with current representations?

I need only to check if current and dumped DOMs are equivalent.


Details

It is not a problem of comparing two distinct DOM objects, but something more easy, because not change IDs, etc. not need canonical representation (!), only need access to internal representation, because is very faster than convert to text.

Between "before and after" there are a black-box procedure, ex. a XSLT Identity transform that affects (or not) some nodes or attributes.

Alternative solution...

  1. ... To develop a C function for LibXML2 that compares node-by-node the two trees, and return false if they are different: during the tree traversal, if tree structure changes, or some nodeValue changes, the algorithm stops the comparison (returning false).

  2. ... Not the ideal, but helps some other algorithms: if I can access (in LibXML2) the total number of nodes or the total length or size or md5 or sha1... Only to optimize frequent cases (for my application) where the comparison will returns false, avoiding the complete comparison-procedure.


NOTES

Related questions

  • How to check if a DomDocument was changed with a simple and fast comparison?
  • C byte-by-byte comparison
  • libxml xmlNodePtr to raw xml string?

Warning for reader using answered solutions

The problem is about "to compare before with after a back-box operation", but there are two kinds of back-boxes here:

  • Well-known and controllable ones, like XSLT transforms or use of a known library. You must known that your black-boxes will not change attribute order or ID content or denormalize spaces (or etc.).
  • Full-free ones, like use of a external editor (ex. online-editor changing a XHTML), where user and software can do anything.

I will use a solution in a context of "well-known" black-box. So, my comments at "Details" section above, are valid.

In a context of "full-free" back-boxes, you can not to use a "comparison of binary dumps", because only a canonical representation (C14N) is valid to compare. To compare by C14N-criteria, only "Alternative solutions" (commented above) are possible. For alternative-1, you must, among other things, sort before compare a set of attribute-nodes. For alternative-2 (also discussed here), to generate the C14N dumps.


PS: of course, use of the C14N criteria is subjective, depends on application: if, p. ex., for your appication "change attribute order" is a valid/important change, the comparasion that detects it is valid (!).

回答1:

Here are the relevant libxml2 methods:

There is a base64 encoding method:

Function: xmlTextWriterWriteBase64

int xmlTextWriterWriteBase64    (xmlTextWriterPtr writer, 
                     const char * data, 
                     int start, 
                     int len)

Write an base64 encoded xml text.
writer: the xmlTextWriterPtr
data:   binary data
start:  the position within the data of the first byte to encode
len:    the number of bytes to encode
Returns:    the bytes written (may be 0 because of buffering) or -1 in case of error

and a BinHex encoding method:

Function: xmlTextWriterWriteBinHex
int xmlTextWriterWriteBinHex    (xmlTextWriterPtr writer, 
                     const char * data, 
                     int start, 
                     int len)

Write a BinHex encoded xml text.
writer: the xmlTextWriterPtr
data:   binary data
start:  the position within the data of the first byte to encode
len:    the number of bytes to encode
Returns:    the bytes written (may be 0 because of buffering) or -1 in case of error

References

  • Module xmlwriter from libxml2

  • ChangeLog last entries of libxml2

  • The XML C parser and toolkit of Gnome: API Alphabetic Index A-B for libxml2

  • libxml Encodings Support

  • binhex.py