Sanitizing strings to make them URL and filename s

2019-01-03 11:53发布

I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.

So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.

 * Convert a string to the file/URL safe "slug" form
 * @param string $string the string to clean
 * @param bool $is_filename TRUE will allow additional filename characters
 * @return string
function sanitize($string = '', $is_filename = FALSE)
 // Replace all weird characters with dashes
 $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);

 // Only allow one dash separator at a time (and make string lowercase)
 return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');

Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?

$is-filename allows some additional characters like temp vim files

update: removed the star character since I could not think of a valid use

2楼-- · 2019-01-03 12:00

I don't think having a list of chars to remove is safe. I would rather use the following:

For filenames: Use an internal ID or a hash of the filecontent. Save the document name in a database. This way you can keep the original filename and still find the file.

For url parameters: Use urlencode() to encode any special characters.

3楼-- · 2019-01-03 12:01

Here's CodeIgniter's implementation.

 * Sanitize Filename
 * @param   string  $str        Input file name
 * @param   bool    $relative_path  Whether to preserve paths
 * @return  string
public function sanitize_filename($str, $relative_path = FALSE)
    $bad = array(
        '../', '<!--', '-->', '<', '>',
        "'", '"', '&', '$', '#',
        '{', '}', '[', ']', '=',
        ';', '?', '%20', '%22',
        '%3c',      // <
        '%253c',    // <
        '%3e',      // >
        '%0e',      // >
        '%28',      // (
        '%29',      // )
        '%2528',    // (
        '%26',      // &
        '%24',      // $
        '%3f',      // ?
        '%3b',      // ;
        '%3d'       // =

    if ( ! $relative_path)
        $bad[] = './';
        $bad[] = '/';

    $str = remove_invisible_characters($str, FALSE);
    return stripslashes(str_replace($bad, '', $str));

And the remove_invisible_characters dependency.

function remove_invisible_characters($str, $url_encoded = TRUE)
    $non_displayables = array();

    // every control character except newline (dec 10),
    // carriage return (dec 13) and horizontal tab (dec 09)
    if ($url_encoded)
        $non_displayables[] = '/%0[0-8bcef]/';  // url encoded 00-08, 11, 12, 14, 15
        $non_displayables[] = '/%1[0-9a-f]/';   // url encoded 16-31

    $non_displayables[] = '/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+/S';   // 00-08, 11, 12, 14-31, 127

        $str = preg_replace($non_displayables, '', $str, -1, $count);
    while ($count);

    return $str;
4楼-- · 2019-01-03 12:04

Try this:

function normal_chars($string)
    $string = htmlentities($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
    $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);

    return trim($string, ' -');


echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA

Based on the selected answer in this thread: URL Friendly Username in PHP?

5楼-- · 2019-01-03 12:05

I've always thought Kohana did a pretty good job of it.

public static function title($title, $separator = '-', $ascii_only = FALSE)
if ($ascii_only === TRUE)
// Transliterate non-ASCII characters
$title = UTF8::transliterate_to_ascii($title);

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'a-z0-9\s]+!', '', strtolower($title));
// Remove all characters that are not the separator, letters, numbers, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'\pL\pN\s]+!u', '', UTF8::strtolower($title));

// Replace all separator characters and whitespace by a single separator
$title = preg_replace('!['.preg_quote($separator).'\s]+!u', $separator, $title);

// Trim separators from the beginning and end
return trim($title, $separator);

The handy UTF8::transliterate_to_ascii() will turn stuff like ñ => n.

Of course, you could replace the other UTF8::* stuff with mb_* functions.

6楼-- · 2019-01-03 12:05

This is a nice way to secure an upload filename:

$file_name = trim(basename(stripslashes($name)), ".\x00..\x20");
7楼-- · 2019-01-03 12:08

This is the code used by Prestashop to sanitize urls :


is used by


to remove diacritics

function replaceAccentedChars($str)
    $patterns = array(
        /* Lowercase */

        /* Uppercase */

    $replacements = array(
            'a', 'c', 'd', 'e', 'i', 'l', 'n', 'o', 'r', 's', 'ss', 't', 'u', 'y', 'z', 'ae', 'oe',
            'A', 'C', 'D', 'E', 'L', 'N', 'O', 'R', 'S', 'T', 'U', 'Z', 'AE', 'OE'

    return preg_replace($patterns, $replacements, $str);

function str2url($str)
    if (function_exists('mb_strtolower'))
        $str = mb_strtolower($str, 'utf-8');

    $str = trim($str);
    if (!function_exists('mb_strtolower'))
        $str = replaceAccentedChars($str);

    // Remove all non-whitelist chars.
    $str = preg_replace('/[^a-zA-Z0-9\s\'\:\/\[\]-\pL]/u', '', $str);
    $str = preg_replace('/[\s\'\:\/\[\]-]+/', ' ', $str);
    $str = str_replace(array(' ', '/'), '-', $str);

    // If it was not possible to lowercase the string with mb_strtolower, we do it after the transformations.
    // This way we lose fewer special chars.
    if (!function_exists('mb_strtolower'))
        $str = strtolower($str);

    return $str;
登录 后发表回答