Simple Wiki Parser And Link Autodetection

2019-07-28 02:50发布

问题:

I'm using the following functions:

function MakeLinks($source){
 return preg_replace('!(((f|ht){1}tp://)[-a-zA-Zа-яА-Я()0-9@:%_+.~#?&;//=]+)!i', '<a href="/1">$1</a>', $source);
}

function simpleWiki($text){
 $text = preg_replace('/\[\[Image:(.*)\]\]/', '<a href="$1"><img src="$1" /></a>', $text);
 return $text;
}

The first one converts http://example.com into http://example.com link.

The second function turns strings like [[Image:http://example.com/logo.png]] into an image.

Now if I have a text

$text = 'this is my image [[Image:http://example.com/logo.png]]';

and convert it like this simpleWiki(makeLinks($text)) it outputs something similar to:

this is my image <a href="url"><img src="<a href="url">url</a>"/></a>

How can I prevent this? How to check that the URL is not part of a [[Image:URL]] construction?

回答1:

Your immediate problem can be solved by combining the two expressions into one (with two alternatives) and then using the not-so-well-known-but-very-powerful: preg_replace_callback() function which handles each case separately in one pass through the target string like so:

<?php // test.php 20110312_1200
$data = "[[Image:http://example.com/logo1.png]]\n".
        "http://example1.com\n".
        "[[Image:http://example.com/logo2.png]]\n".
        "http://example2.com\n";

$re = '!# Capture WikiImage URLs in $1 and other URLs in $2.
      # Either $1: WikiImage URL
      \[\[Image:(.*?)\]\]
    | # Or $2: Non-WikiImage URL.
      (((f|ht){1}tp://)[-a-zA-Zа-яА-Я()0-9@:%_+.~#?&;//=]+)
      !ixu';

$data = preg_replace_callback($re, '_my_callback', $data);

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {  // $1: WikiImage URL
        return '<a href="'. $matches[1] .
            '"><img src="'. $matches[1] .'" /></a>';
    }
    else {              // $2: Non-WikiImage URL.
        return '<a href="'. $matches[2] .
            '">'. $matches[2] .'</a>';
    }
}
echo($data);
?>

This script implements your two regexes and does what you are asking. Note that I did change the greedy (.*) to the (.*?) lazy version because the greedy version does not work correctly (it fails to handle multiple WikiImages). I also added the 'u' modifier to the regex (which is needed when a pattern contains Unicode characters). As you can see, the preg callback function is very powerful. (This technique can be used to do some pretty heavy lifting, text-processing-wise.)

However, please note that the regex you are using to pick out URLs can be significantly improved. Check out the following resources for more information on "Linkifying" URLs (Hint: there are a bunch of "gotchas"):
The Problem With URLs
An Improved Liberal, Accurate Regex Pattern for Matching URLs
URL Linkification (HTTP/FTP)



回答2:

In your MakeLinks add this [^:"]{1}, see below:

function MakeLinks($source){
    return preg_replace('![^:"]{1}(((f|ht){1}tp://)[-a-zA-Zа-яА-Я()0-9@:%_+.~#?&;//=]+)!i', '<a href="/1">$1</a>', $source);
}

Then only the link without ":" before (like in Image:) will be transform. And use $text = simpleWiki(MakeLinks($text));.

EDIT : You can change with this: preg_replace('![[:space:]](((f|ht){1}tp://)[-a-zA-Zа-яА-Я()0-9@:%_+.~#?&;//=]+)[[:space:]]!i', '<a href="$1">$1</a>', $source);