Php routing preg_match help needed

2019-08-26 07:14发布

I have a custom routing class, which allows me to do matches like this on requests:

'/[*:cat1]/[*:cat2]/?[*:cat3]/?[*:cat4]/?[p:page]/?'

Which will match the following links:

category-one/
category-one/cat-two/
category-one/cat-two/cat-three/
category-one/cat-two/cat-three/cat-four/

As you can see the ? after / means that parameter is optional.

My problem is with [p:page]/? which is also optional.

category-one/page-2/
category-one/cat-two/page-2/
category-one/cat-two/cat-three/page-2/
category-one/cat-two/cat-three/cat-four/page-2/

My problem is that when i try to match this link

/category-one/cat-two/page-2/

it will give me these params:

cat1 => category-one
cat2 => cat-two
cat3 => page-2

Instead of

cat1 => category-one
cat2 => cat-two
page => page-2

I am using this generated regexp:

`^(?:/(?P<cat1>[^/\.]+))(?:/(?P<cat2>[^/\.]+/)?)(?:(?P<cat3>[^/\.]+/)?)(?:(?P<cat4>[^/\.]+/)?)(?:(?P<page>(a^)|(?:pag-)(\d+)/)?)$`u

Any help is appreciated. Thanks! Alex

1条回答
Evening l夕情丶
2楼-- · 2019-08-26 07:46

I would use a token lexer/parser approach. I have a few examples on my git hub page at:

https://github.com/ArtisticPhoenix/MISC/tree/master/Lexers

These are others I have used to answer questions on SO, one is a JSON Object parser not a JSON string. This would be malformed JSON without the " around the properties which json_decode can't handle. The other is a HTML minifier (in an OOP style, same concept though) that you can exclude things like <textarea> tags from because white space matters there. So you can do pretty much any kind of processing of text with this method.

I modified one, but I don't really know how you want the output or what you want to do with it, but it should get you started. Probably you will have to integrate it into your URL routing class, which I have no idea what that looks like. But this is a far better method to use then a simple preg_match because it gives you a place to preform complex logic on each segment of the match.

 //don't edit this part.
function parse($subject, $tokens)
{
    $types = array_keys($tokens);
    $patterns = [];
    $lexer_stream = [];
    $result = false;
    foreach ($tokens as $k=>$v){
        $patterns[] = "(?P<$k>$v)";
    }
    $pattern = "/".implode('|', $patterns)."/i";
    if (preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE)) {
        //print_r($matches);
        foreach ($matches[0] as $key => $value) {
            $match = [];
            foreach ($types as $type) {
                $match = $matches[$type][$key];
                if (is_array($match) && $match[1] != -1) {
                    break;
                }
            }
            $tok  = [
                'content' => $match[0],
                'type' => $type,
                'offset' => $match[1]
            ];
            $lexer_stream[] = $tok;
        }
        $result = parseTokens( $lexer_stream );
    }
    return $result;
}

//make changes here to how the tokens are dealt with
function parseTokens( array &$lexer_stream ){
    $result = [];

    while($current = current($lexer_stream)){
        $content = $current['content'];
        $type = $current['type'];
        switch($type){  
            case 'T_EOF': return;

            //custom code for you tokens.
            case 'T_DELIMTER': 
            case 'T_BASE': 
                //ignore these
                next($lexer_stream); //don't forget to call next
            break;
            case 'T_CAT':
                $cat = substr($content, 4);
                echo "This is Cat ".$cat."\n";
                next($lexer_stream);
            break;
            case 'T_PAGE':
                $page = substr($content, 5);
                echo "This is Page".$page;
                next($lexer_stream);
            break;

            //catch all token
            case 'T_UNKNOWN':
            default:
                print_r($current);
                trigger_error("Unknown token $type value $content", E_USER_ERROR);
        }
    }
    if( !$current ) return;
    print_r($current);
    trigger_error("Unclosed item $mode for $type value $content", E_USER_ERROR);
}

/**
 * token should be "name" => "regx"
 * 
 * Order is important
 * 
 * @var array $tokens
 */
$tokens = [
    'T_EOF'             => '\Z',
    'T_DELIMTER'        => '\/',
    'T_BASE'            => 'category-one',
    'T_CAT'             => 'cat-(?:one|two|three|four)',
    'T_PAGE'            => 'page-\d+',
    'T_UNKNOWN'         => '.+?',
];

$subject = '/category-one/cat-two/page-2/';

parse($subject, $tokens);

echo "\n\n========================================\n\n";

$subject = '/category-one/cat-two/cat-three/cat-four/page-2/';

parse($subject, $tokens);

You can see it in action here

Output of the above code:

//$subject = '/category-one/cat-two/page-2/';
This is Cat two
This is Page2

========================================

//$subject = '/category-one/cat-two/cat-three/cat-four/page-2/';
This is Cat two
This is Cat three
This is Cat four
This is Page2

How it works, this basically uses preg match all, but it is wrapped up in a convince type deal to make processing the output a bit easier and building the regular expression. So instead of one monolithic Regx, you wind up with a smaller easier to deal with one. It seems complicated at first but in reality once you understand what it does it makes it so so much easier.

You can even check the order if you want by adding some logic into the parseTokens function. This should be the only place you have to edit stuff and mainly in the token switch statement.

The regx it creates is like this

/(?P<T_EOF>\Z)|(?P<T_DELIMTER>\/)|(?P<T_BASE>category-one)|(?P<T_CAT>cat-(?:one|two|three|four))|(?P<T_PAGE>page-\d+)|(?P<T_UNKNOWN>.+?)/i

So you can't add sub-capture groups note when I added the or in this one cat-(?:one|two|three|four) it's a non-capture group. But you can just use substr to separate it later so it's no big deal.

The \Z is a bit obscure, but it just matches the end of the string without capturing anything.

Also the processing part is called like this (in parse):

$result = parseTokens( $lexer_stream );
...
return $result;

So you can return data that will get returned through the parse function to where you called it (if you wish)

  $something = parse($subject,$tokens);

I don't have the time right now to go into the full explanation of what a lexer is or how it all works. So hopefully this is enough to get you started.

UPDATE

It's a good start, but your code is very specific,

To counter this (don't get me wrong or take this the wrong way) I feal I need to explain it a bit further.

This is very generalized

$tokens = [
    'T_EOF'             => '\Z',
    'T_DELIMTER'        => '\/',
    'T_BASE'            => 'category-one',
    'T_CAT'             => 'cat-(?:one|two|three|four)',
    'T_PAGE'            => 'page-\d+',
    'T_UNKNOWN'         => '.+?',
];

This is very specific

`^(?:/(?P<cat1>[^/\.]+))(?:/(?P<cat2>[^/\.]+/)?)(?:(?P<cat3>[^/\.]+/)?)(?:(?P<cat4>[^/\.]+/)?)(?:(?P<page>(a^)|(?:pag-)(\d+)/)?)$`u

If you have to edit that it's going to be a huge problem, what if you want to route to books or something else. How are you going to expand on that? I don't even know where to begin.

The array approach I have given you, you simple add it

$tokens = [
    'T_EOF'             => '\Z',
    'T_DELIMTER'        => '\/',
    'T_BASE'            => 'category-one',
    'T_CAT'             => 'cat-(?:one|two|three|four)',
    'T_PAGE'            => 'page-\d+',
    'T_BOOK'            => 'book-\w+',
    'T_UNKNOWN'         => '.+?',
];

Then you modify the switch statement:

  case 'T_BOOK':
       ///do something
  break;

And Bam you can do whatever you want in a clear and concise way. You can add whatever complex logic, what ever error checking etc... that you need to, very easily.

查看更多
登录 后发表回答