I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.
So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.
/**
* Convert a string to the file/URL safe "slug" form
*
* @param string $string the string to clean
* @param bool $is_filename TRUE will allow additional filename characters
* @return string
*/
function sanitize($string = '', $is_filename = FALSE)
{
// Replace all weird characters with dashes
$string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);
// Only allow one dash separator at a time (and make string lowercase)
return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}
Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?
$is-filename allows some additional characters like temp vim files
update: removed the star character since I could not think of a valid use
I don't think having a list of chars to remove is safe. I would rather use the following:
For filenames: Use an internal ID or a hash of the filecontent. Save the document name in a database. This way you can keep the original filename and still find the file.
For url parameters: Use
urlencode()
to encode any special characters.Here's CodeIgniter's implementation.
And the
remove_invisible_characters
dependency.Try this:
Based on the selected answer in this thread: URL Friendly Username in PHP?
I've always thought Kohana did a pretty good job of it.
The handy
UTF8::transliterate_to_ascii()
will turn stuff like ñ => n.Of course, you could replace the other
UTF8::*
stuff with mb_* functions.This is a nice way to secure an upload filename:
This is the code used by Prestashop to sanitize urls :
is used by
to remove diacritics