Can XQuery regex match a null character?

I'd like to remove all NULL characters from the string. I know that the right regex match should be \x00 and I've tried the following XQuery:

replace($message, '\x00', '')

It results in the error:

exerr:ERROR Conversion from XPath2 to Java regular expression syntax failed: Error at character 1 in regular expression \x00: invalid escape sequence

Is there any quick solution or workaround for this issue? I use eXist-db 2.2.

标签： regex xml xquery exist-db

2条回答

女痞

2楼-- · 2019-02-28 00:11

Basically, the answer is that there cannot be any NUL (x00) characters in the string. XML, and therefore the XDM data model, does not allow them. So if they appear in your input, you're already outside the scope of the standards.

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-02-28 00:23

The short version: you can't, at least not within the boundaries of the XQuery and XML specifications. There may be an eXist-DB-proprietary method I am not aware of (something like nativly interfacing the Java regular expression functions from XQuery, which seems to be possible in eXist DB), but I would not consider this a "quick solution or workaround".

Looking through the XPath and XQuery Functions and Operators 3.0 specification which also contains the definition of regular expressions for XQuery 3.0, there is no specified way of escaping characters by their unicode code point. The \x00 syntax is specific to some regular expression implementations. regular-expressions.info verifies this assumption:

XML regular expressions don't have any tokens like \xFF or \uFFFF to match particular (non-printable) characters. You have to add them as literal characters to your regex. If you are entering the regex into an XML file using a plain text editor, then you can use the  XML syntax. Otherwise, you'll need to paste in the characters from a character map.

Considering this, there might be two options:

Using XML entities to denote the null byte. This is also not possible, as the XML specification does not allow control characters by definition in Extensible Markup Language (XML) 1.0 (Fifth Edition):
```
CharRef    ::=      '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
```
With the additional restriction of allowed characters in the same specification:
```
Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
```
XML 1.1 extends this definition to control characters -- containing all of them but the null byte:
```
Char       ::=      [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
```
Finally, XQuery relies on the same specification considering allowed characters:
```
Char       ::=      [http://www.w3.org/TR/REC-xml#NT-Char]
```
Directly including the null byte in the XQuery document. Apart from issues in practice (including null bytes in files will often result in unexpected issues of various kinds), the same limitations to characters as defined above apply (well-formed XML documents must only consist of characters as defined above):
```
document       ::=      ( prolog element Misc* ) - ( Char* RestrictedChar Char* ) 
```
There is an extended discussion of this in Why are “control” characters illegal in XML 1.0?

0人赞添加讨论(0) 举报

Can XQuery regex match a null character?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间