PHP regular expression script to remove anything that is not a alphabetical letter or number 0 to 9 and replace space to a hyphen - change to lowercase make sure there is only one hyphen - between words no -- or --- etc.
For example:
Example: The quick brown fox jumped
Result: the-quick-brown-fox-jumped
Example: The quick brown fox jumped!
Result: the-quick-brown-fox-jumped
Example: The quick brown fox - jumped!
Result: the-quick-brown-fox-jumped
Example: The quick ~`!@#$%^ &*()_+= ------- brown {}|][ :"'; <>?.,/ fox - jumped!
Result: the-quick-brown-fox-jumped
Example: The quick 1234567890 ~`!@#$%^ &*()_+= ------- brown {}|][ :"'; <>?.,/ fox - jumped!
Result: the-quick-1234567890-brown-fox-jumped
Anybody have idea for the regular expression?
Thanks!
Since you seem to want all sequences of non-alphanumeric characters being replaced by a single hyphen, you can use this:
$str = preg_replace('/[^a-zA-Z0-9]+/', '-', $str);
But this can result in leading or trailing hyphens that can be removed with trim
:
$str = trim($str, '-');
And to convert the result into lowercase, use strtolower
:
$str = strtolower($str);
So all together:
$str = strtolower($str);
$str = trim($str, '-');
$str = preg_replace('/[^a-z0-9]+/', '-', $str);
Or in a compact one-liner:
$str = strtolower(trim(preg_replace('/[^a-zA-Z0-9]+/', '-', $str), '-'));
I was just working with something similar, and I came up with this little piece of code, it also contemplates the use of latin characters.
This is the sample string:
$str = 'El veloz murciélago hindú comía fe<!>&@#$%&!"#%&?¡?*liz cardillo y kiwi. La cigüeña ¨^;.-|°¬tocaba el saxofón detrás del palenque de paja';
First I convert the string to htmlentities just to make it easier to use later.
$friendlyURL = htmlentities($str, ENT_COMPAT, "UTF-8", false);
Then I replace latin characters with their corresponding ascii characters (á
becomes a
, Ü
becomes U
, and so on):
$friendlyURL = preg_replace('/&([a-z]{1,2})(?:acute|circ|lig|grave|ring|tilde|uml|cedil|caron);/i','\1',$friendlyURL);
Then I convert the string back from html entities to symbols, again for easier use later.
$friendlyURL = html_entity_decode($friendlyURL,ENT_COMPAT, "UTF-8");
Next I replace all non alphanumeric characters into hyphens.
$friendlyURL = preg_replace('/[^a-z0-9-]+/i', '-', $friendlyURL);
I remove extra hyphens inside the string:
$friendlyURL = preg_replace('/-+/', '-', $friendlyURL);
I remove leading and trailing hyphens:
$friendlyURL = trim($friendlyURL, '-');
And finally convert all into lowercase:
$friendlyURL = strtolower($friendlyURL);
All together:
function friendlyUrl ($str = '') {
$friendlyURL = htmlentities($str, ENT_COMPAT, "UTF-8", false);
$friendlyURL = preg_replace('/&([a-z]{1,2})(?:acute|circ|lig|grave|ring|tilde|uml|cedil|caron);/i','\1',$friendlyURL);
$friendlyURL = html_entity_decode($friendlyURL,ENT_COMPAT, "UTF-8");
$friendlyURL = preg_replace('/[^a-z0-9-]+/i', '-', $friendlyURL);
$friendlyURL = preg_replace('/-+/', '-', $friendlyURL);
$friendlyURL = trim($friendlyURL, '-');
$friendlyURL = strtolower($friendlyURL);
return $friendlyURL;
}
Test:
$str = 'El veloz murciélago hindú comía fe<!>&@#$%&!"#%&-?¡?*-liz cardillo y kiwi. La cigüeña ¨^`;.-|°¬tocaba el saxofón detrás del palenque de paja';
echo friendlyUrl($str);
Outcome:
el-veloz-murcielago-hindu-comia-fe-liz-cardillo-y-kiwi-la-ciguena-tocaba-el-saxofon-detras-del-palenque-de-paja
I guess Gumbo's answer fits your problem better, and it's a shorter code, but I thought it would be useful for others.
Cheers,
Adriana
In a function:
function sanitize_text_for_urls ($str)
{
return trim( strtolower( preg_replace(
array('/[^a-z0-9-\s]/ui', '/\s/', '/-+/'),
array('', '-', '-'),
iconv('UTF-8', 'ASCII//TRANSLIT', $str) )), '-');
}
What it does:
// Solve accents and diacritics
$str = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
// Leave only alphanumeric (respect existing hyphens)
$str = preg_replace('/[^a-z0-9-\s]/ui', '', $str);
// Turn spaces to hyphens
$str = preg_replace('/\s+/', '-', $str);
// Remove duplicate hyphens
$str = preg_replace('/-+/', '-', $str);
// Remove trailing hyphens
$str = trim($str, '-');
// Turn to lowercase
$str = strtolower($str);
Note:
You can combine multiple preg_replace
by passing an array. See the function at the top.
For example:
// Électricité, plâtrerie --> electricite-platrerie
// St. Lücie-Pétêrès --> st-lucie-peteres
// -Façade- & gros œuvre --> facade-gros-oeuvre
// _-Thè quîck ~`!@#&$%^ &*()_+= ---{}|][ :"; <>?.,/ fóx - jümpëd_-
// the-quick-fox-jumped
EDIT: added "/u" at the end of the regex to use UTF8
EDIT: accounted for duplicated and leading/trailing hyphens, thanks to @LuBre
If you're using this for filenames in PHP, the answer by Gumbo would be
$str = preg_replace('/[^a-zA-Z0-9.]+/', '-', $str);
$str = trim($str, '-');
$str = strtolower($str);
Added a period for file names and it's strtolower()
, not strtolowercase()
.
$str = preg_replace('/[^a-zA-Z0-9]/', '-', $str);