simplifying regular expression

2019-06-21 05:31发布

问题:

I am going through exercises regarding regular expressions, and I am really unsure on how to do this.

The regular expression is:

((a*)(b*))* ∪ (a*)

I am really bad at this, but I think that ((a*)(b*))* can be simplified to (a ∪ b)* But if this is right, than the last ∪ (a*) is really just a repetition, so I figure the whole expression can be simplified to (a ∪ b)*. Does this seem correct?

Edit: ∪ stands for union

回答1:

You are right. (a*b*)* can match any string of a's and b's, so can (a U b)*, therefore they are equivalent. (a U b)* intersect a* is a* so a* is a subset of (a U b)*. Consequently, the whole expression can be simplified to (a U b)*.



回答2:

What ((a*)(b*))*U(a*) really means is (copied from here)

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \1 (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    (                        group and capture to \2:
--------------------------------------------------------------------------------
      a*                       'a' (0 or more times (matching the
                               most amount possible))
--------------------------------------------------------------------------------
    )                        end of \2
--------------------------------------------------------------------------------
    (                        group and capture to \3:
--------------------------------------------------------------------------------
      b*                       'b' (0 or more times (matching the
                               most amount possible))
--------------------------------------------------------------------------------
    )                        end of \3
--------------------------------------------------------------------------------
  )*                       end of \1 (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in \1)
--------------------------------------------------------------------------------
  U                        'U'
--------------------------------------------------------------------------------
  (                        group and capture to \4:
--------------------------------------------------------------------------------
    a*                       'a' (0 or more times (matching the most
                             amount possible))
--------------------------------------------------------------------------------
  )                        end of \4

This expression currently matches all of these sequences: abUa bU U aabbUaa aaUaa aaU Uaa bbU ababUaa aabbaabbUaa (look at here)

There is no way to simplify this, without removing capturing groups and remaining order of letters.

EDIT: If U in your regex statement stands for "union", then this expression is invalid. There is no way to union anything in regex. There is only OR and you need to use | (pipe) for that. If you want to union ((a*)(b*))* and (a*) then probably it will be ((a*)(b*))*, but it will still match anything like abaab.

Still, capturing groups in your regex statement are useless, so something like [ab]* is enough to match any number of a's and b's.