Can XQuery regex match a null character?

2019-02-27 23:52发布

I'd like to remove all NULL characters from the string. I know that the right regex match should be \x00 and I've tried the following XQuery:

replace($message, '\x00', '')

It results in the error:

exerr:ERROR Conversion from XPath2 to Java regular expression syntax failed: Error at character 1 in regular expression \x00: invalid escape sequence

Is there any quick solution or workaround for this issue? I use eXist-db 2.2.

2条回答
女痞
2楼-- · 2019-02-28 00:11

Basically, the answer is that there cannot be any NUL (x00) characters in the string. XML, and therefore the XDM data model, does not allow them. So if they appear in your input, you're already outside the scope of the standards.

查看更多
可以哭但决不认输i
3楼-- · 2019-02-28 00:23

The short version: you can't, at least not within the boundaries of the XQuery and XML specifications. There may be an eXist-DB-proprietary method I am not aware of (something like nativly interfacing the Java regular expression functions from XQuery, which seems to be possible in eXist DB), but I would not consider this a "quick solution or workaround".

Looking through the XPath and XQuery Functions and Operators 3.0 specification which also contains the definition of regular expressions for XQuery 3.0, there is no specified way of escaping characters by their unicode code point. The \x00 syntax is specific to some regular expression implementations. regular-expressions.info verifies this assumption:

XML regular expressions don't have any tokens like \xFF or \uFFFF to match particular (non-printable) characters. You have to add them as literal characters to your regex. If you are entering the regex into an XML file using a plain text editor, then you can use the  XML syntax. Otherwise, you'll need to paste in the characters from a character map.

Considering this, there might be two options:

  1. Using XML entities to denote the null byte. This is also not possible, as the XML specification does not allow control characters by definition in Extensible Markup Language (XML) 1.0 (Fifth Edition):

    CharRef    ::=      '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
    

    With the additional restriction of allowed characters in the same specification:

    Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
    

    XML 1.1 extends this definition to control characters -- containing all of them but the null byte:

    Char       ::=      [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
    

    Finally, XQuery relies on the same specification considering allowed characters:

    Char       ::=      [http://www.w3.org/TR/REC-xml#NT-Char]
    
  2. Directly including the null byte in the XQuery document. Apart from issues in practice (including null bytes in files will often result in unexpected issues of various kinds), the same limitations to characters as defined above apply (well-formed XML documents must only consist of characters as defined above):

    document       ::=      ( prolog element Misc* ) - ( Char* RestrictedChar Char* ) 
    

    There is an extended discussion of this in Why are “control” characters illegal in XML 1.0?

查看更多
登录 后发表回答