What would be a regex to replace/remove END where

2019-02-18 06:18发布

What would be a regex (PHP) to replace/remove (using preg_replace()) END where its not been preceded by an unended START?

Here are a few examples to portray what I mean better:

Example 1:

Input:

sometext....END

Output:

sometext.... //because theres no START, therefore no need for the excess END

Example 2:

Input:

STARTsometext....END

Output:

STARTsometext....END //because its preceded by a START

Example 3:

Input:

STARTsometext....END.......END

Output:

STARTsometext....END....... //because the END is not preceded by a START

Hoping someone can help?

Thank You.

3条回答
劳资没心,怎么记你
2楼-- · 2019-02-18 06:28

It is not possible to write a regular expression for all possible syntax. For your case you might need a context free parser like an ascendent or descendent one. See: http://en.wikipedia.org/wiki/Formal_grammar

查看更多
一夜七次
3楼-- · 2019-02-18 06:40

This is a textbook example of a non-regular language (START and END are the equivalent of opening and closing parentheses). That means you cannot match this language with a simple regular expression. You can do it to some specific depth with a complicated regex, but not arbitrary depth.

You need to write a language parser.

Related reading:

http://www.amazon.com/Introduction-Automata-Theory-Languages-Computation/dp/0321462254/ref=sr_1_1?ie=UTF8&qid=1291768284&sr=8-1

查看更多
4楼-- · 2019-02-18 06:41

Assuming you aren't looking for nested pairs, there is a simple solution to remore excess ENDs. Consider:

$str = preg_replace("/END|(START.*?END)/", "$1", $str);

This is a little backwards replacement, but it makes sense if you understand the order in which the engine works. First, the regex is made of two main parts: END|(). The alternations are tried from left to right, so if the engine sees an END in the input string, it will match it and move on to the next match (that is, look for END again).
The second part is a capturing group, which contains START.*?END - this will match an entire Start/End token if possible. Everything else will be skipped, until it finds another END or START.

Since we use $1 in the replace, which is the captured group, we only save the second token. Therefor, the only way for an END to survive is to get into the capturing group, by being the first one after a START.

For example, for the text END START 123 END abc END. The regex will find the following matches, and keep, skip or remove them accordingly:

  • END - Removed
  • (START 123 END) - Captured
  • a - Skip
  • b - Skip
  • c - Skip
  • END - Removed

Working example: http://ideone.com/suVYh

查看更多
登录 后发表回答