What would be a regex (PHP) to replace/remove (using preg_replace()
) END where its not been preceded by an unended START?
Here are a few examples to portray what I mean better:
Example 1:
Input:
sometext....END
Output:
sometext.... //because theres no START, therefore no need for the excess END
Example 2:
Input:
STARTsometext....END
Output:
STARTsometext....END //because its preceded by a START
Example 3:
Input:
STARTsometext....END.......END
Output:
STARTsometext....END....... //because the END is not preceded by a START
Hoping someone can help?
Thank You.
It is not possible to write a regular expression for all possible syntax. For your case you might need a context free parser like an ascendent or descendent one. See: http://en.wikipedia.org/wiki/Formal_grammar
This is a textbook example of a non-regular language (START and END are the equivalent of opening and closing parentheses). That means you cannot match this language with a simple regular expression. You can do it to some specific depth with a complicated regex, but not arbitrary depth.
You need to write a language parser.
Related reading:
http://www.amazon.com/Introduction-Automata-Theory-Languages-Computation/dp/0321462254/ref=sr_1_1?ie=UTF8&qid=1291768284&sr=8-1
Assuming you aren't looking for nested pairs, there is a simple solution to remore excess ENDs. Consider:
This is a little backwards replacement, but it makes sense if you understand the order in which the engine works. First, the regex is made of two main parts:
END|()
. The alternations are tried from left to right, so if the engine sees anEND
in the input string, it will match it and move on to the next match (that is, look forEND
again).The second part is a capturing group, which contains
START.*?END
- this will match an entire Start/End token if possible. Everything else will be skipped, until it finds another END or START.Since we use
$1
in the replace, which is the captured group, we only save the second token. Therefor, the only way for anEND
to survive is to get into the capturing group, by being the first one after aSTART
.For example, for the text
END START 123 END abc END
. The regex will find the following matches, and keep, skip or remove them accordingly:END
- Removed(START 123 END)
- Captureda
- Skipb
- Skipc
- SkipEND
- RemovedWorking example: http://ideone.com/suVYh