I need to remove all occurrences of a bb style tag from a string. The tags can be nested, and this is where I am failing. I also need to relocate each tag and contents to the end of the string, and replace the tag with an HTML element. I have tried to play with regex and preg_replace_callback, but I have only been so far unsuccessful. I also tried to modify the following, and have also had no luck:
Removing nested bbcode (quotes) in PHP
and
How can I remove an html element and it's contents using RegEx I don't think I can use an HTML parser like this because the HTML is malformed (children in elements that can't have children).
Here is what the string looks like:
This is some
[tag] attribute=1 attribute2=1
[tag] attribute=1 attribute2=1 [/tag]
[tag] attribute=1 attribute2=1 [/tag]
[/tag]
text.
The result should look like this:
This is some text.
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>
Any help would be appreciated.
Street cred: I worked for Infopop (later known as Groupee, now Social Strata), the creators of UBBCode, the thing that was copied and transformed into just plain old regular "BBCode."
tl;dr: Time to write your own non-regex parser.
Most BBCode parsers use regexes, and that works for most cases, but you're doing something custom here. Plain old regular expressions are not going to help you. Regexes have two modes of operation that get in our way: we can either match everything between two tags in "greedy" mode, or in "not greedy" mode.
In "greedy" mode, we'll capture everything between the very first opening task and the very last closing tag. This breaks things horribly. Take this case:
[a][b][c]...[/c][/b][/a]...[a]...[/a]
A greedy regex like \[a\].+\[/a\]
is going to grab everything from that first opening tag to that last closing tag, ignoring the fact that the closer isn't closing the opener.
The other option is worse. Take this case:
[a][b][a]...[/a][/b][/a]
An ungreedy regex like \[a\].+?\[/a\]
(the only change is the question mark) is going to match the first opening tag, but then it'll match the first closing tag, again ignoring that the closing tag doesn't belong to the opening tag.
The way I solved this way, way back in the primitive days was to completely ignore the fact that the opening and closing tags didn't match. I simply looped the entire chain of tag transformation regexes until the output stopped changing. It was simple and effective, mainly because the available tag set was intentionally limited, so nesting was never an issue.
The instant you allow nesting of identical tags, blind, brute force is no longer a suitable tool.
If none of the BBCode parsing engines out there are going to work for you, you might have to write your own. Check all of them out. There are some on PEAR, there's a PECL extension, etc. Also check other languages for inspiration, Perl's CPAN has a dozen different implementations, some of which are very powerful and complex (if there isn't a proper recursive descent parser in that mix, I'll be depressed). This is a good challenge, but it's not too hard. Then again, I've written like five now (none of which I can release), so maybe I'm biased?
Start by exploding the string on [
and ]
. Go through the resulting array, keeping track of when the array index following the opening bracket and before the next closing bracket happens to look like a valid tag and/or attributes. You're going to need to think about what happens when an attribute can contain a bracket, or worse, are URLs that are bracket-heavy (like PHP array syntax). You'll also need to think about attributes in general, including how (if?) they are quoted, if multiple attributes per tag are allowed (as in your example), and what to do with invalid attributes.
As you continue to process the string, you will also need to keep track of what tags are open, and in what order. You'll have to think about what tags are permitted inside other tags. You'll also have to deal with mis-nesting, like [a][b][/a][/b]
. Your options will be either re-opening the inner tag after the outer closes, or closing the inner as soon as the outer does. Worse, different behavior might make sense depending on the situation. Worse-worse are wacky tags like [*]
inside [list]
, which traditionally doesn't have a closing tag!
Once you've processed the string and have created a list of open and closing tags (and possibly re-balanced the opens and closes), then you can transform the result into HTML, or whatever your output ends up being. This is when and how you'd move the output of those specific tags to the end of the new document.
Once you've finished up, write a thousand test cases. Try to break it, blow it into itty bitty chunks, produce XSS vulnerabilities, and otherwise do your best to make your life hell. It will be worth it, because the result will be a BBCode engine that will do what you're trying to do.