I'm working on an application which has a workflow for postal mails. These postal mails are generated according to my application business rules.
Models are in html or Rtf and it works perfectly as long the user do not create the rtf with word. This is not within the specs, but my hierarchy would welcome a Word compatibility if it don't involve too much work, and it would please and ease the life of our customer.
The Rtf models have tags which are replaced by application values. In most RTF, tags are not splitted, so the search and replace works perfectly. I wish to be handle word with few modifications.
Example data : [[FooBuzz]] in most rtf it's not splited.
In word 2003 :
{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]}
And their word (word 2007) splitted also Foo{garbage inside} Buzz.
So i wish to be able to handle common RTF perfectly, and detect tags even if they are splitted.
I have 2 constraints. First no regression, second it has to stay simple. Performance is not an issue here.
I'm using symfony 1.4. The actual relevant research code part :
$regExpression = '/\[\[([^\[\]]*)\]\]/';
preg_match_all($regExpression, $sTemplate, $outKeys);
Update :
I guess i mostly need to perfect this regex. I'm working on some regex but they need some improvements still :
/([\a-zA-Z0-9]+)/
produce :
[0] => Array
(
[0] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[
[1] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz
[2] => \rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]
)
Update 2 :
I still have a few problem with the regex. It actually find tag value and plain text for the first. I'm not sure what i want is even possible in a reasonnable amount of time.
I need to modify the regex, so she catch the same results, but inside [[ ]], actually it works on plain text too.
And even harder i have to be able to catch all my sample data (but not plain text) by whatever i have to.
For my replace regex, which replace my tag and all the garbage. I have almost succedd :
/{.*?\[\[.*(?<!\\)\w+\b.*\]\].*?}/
But it is too greedy. I want to match the group { [[}{tag}{ ]]} and it match {plain text}{ [[}{tag}{ ]]}{plain text}
I add the ? cause i read it would make the .* non greedy but it don't work. Any ideas ?
I can't get what's wrong with this regex (name of tag finding) :
\[\[(\b(?<!\\)\w+\b)\]\]
According to my understanding. It says inside [[ ]], find any word which do no start with a backslawh followed by any word character. Am i right ?
Update 3 :
Sorry i was unclear.
My first regex aim to catch FooBuzz in [[FooBuzz]]. And the seconde to catch [[FooBuzz]]. So in the first regex, i want to catch only the text FooBuzz, and ignoring everything else (like {} \eoeoe).
In the seconde place i have to replace [[FooBuzz]] completely. So i have to catch {[[}{FooBuzz}}{]]} and nothing more.
Actually i'm catching {plain text i musn't catch} {[[}{FooBuzz}}{]]}}. See i catch too must here. I'm catching : plain text i musn't catch [[FooBuzz]].
For the [[ part, i need to only catch this : {\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}. I guess that's because he can't find an ungreedy match. So he is in greedy mode. And fail with this data sample :
{\toto toto}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 [[}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid2708730 FooBuzz}{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid5517131 ]]}{\toto toto}