I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names.
They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.
basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above
but curiously the dirname part of pathinfo() works fine:
pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"
PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME)
and pathinfo(..., PATHINFO_DIRNAME)
, not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.
It sounds like a PHP bug.
Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?
I've found that changing the locale fixes everything.
While Apache by default runs with "C" locale, cli scripts by default run with an utf-8 locale instead, such as "en_US.UTF-8" (or in my case "it_IT.UTF-8"). Under these conditions, the problem does not occur.
Therefore, the workaround on Apache consists in changing the locale from "C" to "C.UTF-8" before calling these functions.
setlocale(LC_ALL,'C.UTF-8');
basename("àxà"); // now returns "àxà", which is correct
pathinfo("àyà/àxà", PATHINFO_BASENAME); // now returns "àxà", which is correct
Or even better, if you want to backup the current locale and restore it once done:
$lc = new LocaleManager();
$lc->doBackup();
$lc->fixLocale();
basename("àxà/àyà");
$lc->doRestore();
class LocaleManager
{
/** @var array */
private $backup;
public function doBackup()
{
$this->backup = array();
$localeSettings = setlocale(LC_ALL, 0);
if (strpos($localeSettings, ";") === false)
{
$this->backup["LC_ALL"] = $localeSettings;
}
// If any of the locales differs, then setlocale() returns all the locales separated by semicolon
// Eg: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=C;...
else
{
$locales = explode(";", $localeSettings);
foreach ($locales as $locale)
{
list ($key, $value) = explode("=", $locale);
$this->backup[$key] = $value;
}
}
}
public function doRestore()
{
foreach ($this->backup as $key => $value)
{
setlocale(constant($key), $value);
}
}
public function fixLocale()
{
setlocale(LC_ALL, "C.UTF-8");
}
}