Mirc control codes to html, through php

I realize this has been asked before, on this very forum no less, but the proposed solution was not reliable for me.

I have been working on this for a week or more by now, and I stayed up 'till 3am yesterday working on it... But I digress, let me get to the issue at hand:

For those unaware, mirc uses ascii control codes to control character color, underline, weight, and italics. The ascii code for the color is 3, bold 2, underline 1F, italic 1D, and reverse(white text on black background), 16.

As an example of the form this data is going to come in, we have(in regex because those characters will not print):

\x034this text is red\x033this text is green\x03 \x02bold text\x02
\x034,3this text is red with a green background\x03

Et-cetera.

Below are the two functions I have attempted to modify for my own use, but have returned unreliable results. Before I get into that code, to be specific on 'unreliable', sometimes the code would parse, other times there would still be control codes left in the text, and I can't figure out why. Anyway;

function mirc2html($x) {
    $c = array("FFF","000","00007F","009000","FF0000","7F0000","9F009F","FF7F00","FFFF00","00F800","00908F","00FFFF","0000FF","FF00FF","7F7F7F","CFD0CF");
    $x = preg_replace("/\x02(.*?)((?=\x02)\x02|$)/", "<b>$1</b>", $x);
    $x = preg_replace("/\x1F(.*?)((?=\x1F)\x1F|$)/", "<u>$1</u>", $x);
    $x = preg_replace("/\x1D(.*?)((?=\x1D)\x1D|$)/", "<i>$1</i>", $x);
    $x = preg_replace("/\x03(\d\d?),(\d\d?)(.*?)(?(?=\x03)|$)/e", "'</span><span style=\"color: #'.\$c[$1].'; background-color: #'.\$c[$2].';\">$3</span>'", $x);
    $x = preg_replace("/\x03(\d\d?)(.*?)(?(?=\x03)|$)/e", "'</span><span style=\"color: #'.\$c[$1].';\">$2</span>'", $x);
    //$x = preg_replace("/(\x0F|\x03)(.*?)/", "<span style=\"color: #000; background-color: #FFF;\">$2</span>", $x);
    //$x = preg_replace("/\x16(.*?)/", "<span style=\"color: #FFF; background-color: #000;\">$1</span>", $x);
    //$x = preg_replace("/\<\/span\>/","",$x,1);
    //$x = preg_replace("/(\<\/span\>){2}/","</span>",$x);
    return $x;
}

function color_rep($matches) {
    $matches[2] = ltrim($matches[2], "0");
    $bindings = array(0=>'white',1=>'black',2=>'blue',3=>'green',4=>'red',5=>'brown',6=>'purple',7=>'orange',8=>'yellow',9=>'lightgreen',10=>'#00908F',
        11=>'lightblue',12=>'blue',13=>'pink',14=>'grey',15=>'lightgrey');
    $preg = preg_match_all('/(\d\d?),(\d\d?)/',$matches[2], $col_arr);
    //print_r($col_arr);
    $fg = isset($bindings[$matches[2]]) ? $bindings[$matches[2]] : 'transparent';
    if ($preg == 1) {
        $fg = $bindings[$col_arr[1][0]];
        $bg = $bindings[$col_arr[2][0]];
    }
    else {
        $bg = 'transparent';
    }


    return '<span style="color: '.$fg.'; background: '.$bg.';">'.$matches[3].'</span>';
}

And, in case it is relevant, where the code is called:

$logln = preg_replace_callback("/(\x03)(\d\d?,\d\d?|\d\d?)(\s?.*?)(?(?=\x03)|$)/","color_rep",$logln);

Sources: First, Second

I've of course also attempted to look at the methods done by various php/ajax based irc clients, and there hasn't been any success there. As to doing this mirc-side, I've looked there as well, and although the results have been more reliable than php, the data sent to the server increases exponentially to the point that the socket times out on upload, so it isn't a viable option.

As always, any help in this matter would be appreciated.

You should divide the problem, for example with a tokenizer. A tokenizer will scan the input string and turn the special parts into named tokens, so the rest of your script can identify them. Usage example:

$mirc = "\x034this text is red\x033this text is green\x03 \x02bold text\x02
\x034,3this text is red with a green background\x03";

$tokenizer = new Tokenizer($mirc);

while(list($token, $data) = $tokenizer->getNext())
{
    switch($token)
    {
        case 'color-fgbg':
            printf('<%s:%d,%d>', $token, $data[1], $data[2]);
            break;

        case 'color-fg':
            printf('<%s:%d>', $token, $data[1]);
            break;

        case 'color-reset':
        case 'style-bold';
            printf('<%s>', $token);
            break;

        case 'catch-all':
            echo $data[0];
            break;

        default:
            throw new Exception(sprintf('Unknown token <%s>.', $token));
    }
}

This does not much yet, but identify the interesting parts and their (sub-) values as the output demonstrates:

<color-fg:4>this text is red<color-fg:3>this text is green<color-reset> <style-bold>bold text<style-bold>
<color-fgbg:4,3>this text is red with a green background<color-reset>

It should be relatively easy for you to modify the loop above and handle the states like opening/closing color and font-variant tags like bold.

The tokenizer itself defines a set of tokens of which is tries to find them one after the other at a certain offset (starting at the beginning of the string). The tokens are defined by regular expressions:

/**
 * regular expression based tokenizer,
 * first token wins.
 */
class Tokenizer
{
    private $subject;
    private $offset = 0;
    private $tokens = array(
        'color-fgbg'  => '\x03(\d{1,2}),(\d{1,2})',
        'color-fg'    => '\x03(\d{1,2})',
        'color-reset' => '\x03',
        'style-bold'  => '\x02',
        'catch-all' => '.|\n',
    );
    public function __construct($subject)
    {
        $this->subject = (string) $subject;
    }
    ...

As this private array shows, simple regular expressions and they get a name with their key. That's the name used in the switch statement above.

The next() function will look for a token at the current offset, and if found, will advance the offset and return the token incl. all subgroup matches. As offsets are involved, the more detailed $matches array is simplified (offsets removed) as the main routine normally does not need to know about offsets.

The principle is easy here: The first pattern wins. So you need to place the pattern that matches most (in sense of string length) on top to have this working. In your case, the largest one is the token for the foreground and background color, <color-fgbg>.

In case not token can be found, NULL is returned, so here the next() function:

...
/**
 * @return array|null
 */
public function getNext()
{
    if ($this->offset >= strlen($this->subject))
        return NULL;

    foreach($this->tokens as $name => $token)
    {
        if (FALSE === $r = preg_match("~$token~", $this->subject, $matches, PREG_OFFSET_CAPTURE, $this->offset))
            throw new RuntimeException('Pattern for token %s failed (regex error).', $name);
        if ($r === 0)
            continue;
        if (!isset($matches[0])) {
            var_dump(substr($this->subject, $this->offset));
            $c = 1;
        }
        if ($matches[0][1] !== $this->offset)
            continue;
        $data = array();
        foreach($matches as $match)
        {
            list($data[]) = $match;
        }

        $this->offset += strlen($data[0]);
        return array($name, $data);
    }
    return NULL;
}
...

So the tokenization of the string is now encapsulated into the Tokenizer class and the parsing of the token is something you can do your own inside some other part of your application. That should make it more easy for you to change the way of styling (HTML output, CSS based HTML output or something differnt like bbcode or markdown) but also the support of new codes in the future. Also in case something is missing you can more easily fix things because it's either a non-recognized code or something missing with the transformation.

The full example as gist: Tokenizer Example of Mirc Color and Style (bold) Codes.

Related resources: