How regular expression OR operator is evaluated

In T-SQL I have generated UNIQUEIDENTIFIER using NEWID() function. For example:

723952A7-96C6-421F-961F-80E66A4F29D2

Then, all dashes (-) are removed and it looks like this:

723952A796C6421F961F80E66A4F29D2

Now, I need to turn the string above to a valid UNIQUEIDENTIFIER using the following format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx and setting the dashes again.

To achieve this, I am using SQL CLR implementation of the C# RegexMatches function with this ^.{8}|.{12}$|.{4} regular expression which gives me this:

SELECT *
FROM [dbo].[RegexMatches] ('723952A796C6421F961F80E66A4F29D2', '^.{8}|.{12}$|.{4}')

enter image description here

Using the above, I can easily build again a correct UNIQUEIDENTIFIER but I am wondering how the OR operator is evaluated in the regular expression. For example, the following will not work:

SELECT *
FROM [dbo].[RegexMatches] ('723952A796C6421F961F80E66A4F29D2', '^.{8}|.{4}|.{12}$')

enter image description here

Is it sure that the first regular expression will first match the start and the end of the string, then the other values and is always returning the matches in this order (I will have issues if for example, 96C6 is matched after 421F).

标签： c# .net sql-server regex sql-server-2012

2条回答

劳资没心，怎么记你

2楼-- · 2020-04-05 10:40

If you are interested in what happens when you use | alternation operator, the answer is easy: the regex engine processes the expression and the input string from left to right.

Taking the pattern you have as an example, ^.{8}|.{12}$|.{4} starts inspecting the input string from the left, and checks for ^.{8} - first 8 characters. Finds them and it is a match. Then, goes on and finds the last 12 characters with .{12}$, and again there is a match. Then, any 4-character strings are matched.

Regular expression visualization

Debuggex Demo

Next, you have ^.{8}|.{4}|.{12}$. The expression is again parsed from left to right, first 8 characters are matched first, but next, only 4-character sequences will be matched, .{12} won't ever fire because there will be .{4} matches!

Regular expression visualization

Debuggex Demo

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2020-04-05 10:48

Your Regex ^.{8}|.{12}$|.{4} evaluates to:

Starting with any character except \n { Exactly 8 times }

OR any character except \n { Exactly 12 times }

OR any character except \n { Exactly 4 times } globally

This means that anything after 4 characters in a row will be matched because somewhere in a string of >4 characters there are 4 characters in a row.

1 [false]

12 [false]

123 [false]

1234 [true]

12345 [true]

123456 [true]

1234567 [true]

12345678 [true]

123456789 [true]

1234567890 [true]

12345678901 [true]

123456789012 [true]

You might be looking for:

^.{8}$|^.{12}$|^.{4}$

Which gives you:

1 [false]

12 [false]

123 [false]

1234 [true]

12345 [false]

123456 [false]

1234567 [false]

12345678 [true]

123456789 [false]

1234567890 [false]

12345678901 [false]

123456789012 [true]

0人赞添加讨论(0) 举报

How regular expression OR operator is evaluated

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间