What is the best way to handle/remove, UTF-8's

2019-08-07 07:22发布

问题:

There is a utf-8 character (HEX bytes E2 80 AE) that when correctly handled by utf-8 enabled systems will show the ascetically reversed chars, when displayed to the user. Commonly used by snakes to hide or mess with file extensions.

Here is an examples of such filename strings:

an .EXE called: EvilFile‮.EXE

an .scr called: yo.na‮.scr

Filename extension validation would not be a problem if done, it would be the displaying of such string that would cause a problem, htmlentities() causes the string to become: EvilFile�.EXE

So, what would be the best solution to fix the filename back to EvilFile.EXE?

Tests ive done with iconv produce the same kind of encode problems on output.

<!DOCTYPE html>
<head>
    <meta charset="utf-8"> 
    <title></title>
</head>

<body>
<?php
$evilString = "EvilFile‮.EXE";
$ret = null;

$ret .= '<h1>htmlentities/ENT_QUOTES | ENT_IGNORE</h1>';
$ret .= htmlentities($evilString, ENT_QUOTES | ENT_IGNORE, "UTF-8").'<br>';

//enc options
$enc = array(
    "UTF-8", 
    "ASCII", 
    "Windows-1252", 
    "ISO-8859-15", 
    "ISO-8859-1", 
    "ISO-8859-6", 
    "CP1256",
    "US-ASCII//TRANSLIT", 
    "UTF-8//IGNORE",
    "UTF-8//TRANSLIT"
 );

//iconv
foreach ($enc as $i) {
    $ret .= '<h1>iconv/'.$i.'</h1>';
    foreach ($enc as $j) {
        $ret .= " $i - $j: ".@iconv($i, $j, $evilString).'<br>';
    }
}

//mb_convert_encoding
$ret .= '<h1>mb_convert_encoding</h1>';
foreach (mb_list_encodings() as $chr) {
    $ret .= $chr.' - '.mb_convert_encoding($evilString, 'UTF-8', $chr)."<br>";   
} 

echo $ret;
?> 
</body>
</html>

Result

iconv/US-ASCII//TRANSLIT
------------------------
US-ASCII//TRANSLIT - UTF-8: EvilFile
US-ASCII//TRANSLIT - ASCII: EvilFile
US-ASCII//TRANSLIT - Windows-1252: EvilFile
US-ASCII//TRANSLIT - ISO-8859-15: EvilFile
US-ASCII//TRANSLIT - ISO-8859-1: EvilFile
US-ASCII//TRANSLIT - ISO-8859-6: EvilFile
US-ASCII//TRANSLIT - CP1256: EvilFile
US-ASCII//TRANSLIT - US-ASCII//TRANSLIT: EvilFile
US-ASCII//TRANSLIT - UTF-8//IGNORE: EvilFile.EXE <<< - See answer below
US-ASCII//TRANSLIT - UTF-8//TRANSLIT: EvilFile

iconv/UTF-8//IGNORE
-------------------
UTF-8//IGNORE - UTF-8: EvilFile‮.EXE
UTF-8//IGNORE - ASCII: EvilFile
UTF-8//IGNORE - Windows-1252: EvilFile
UTF-8//IGNORE - ISO-8859-15: EvilFile
UTF-8//IGNORE - ISO-8859-1: EvilFile
UTF-8//IGNORE - ISO-8859-6: EvilFile
UTF-8//IGNORE - CP1256: EvilFile
UTF-8//IGNORE - US-ASCII//TRANSLIT: EvilFile
UTF-8//IGNORE - UTF-8//IGNORE: EvilFile‮.EXE
UTF-8//IGNORE - UTF-8//TRANSLIT: EvilFile‮.EXE

iconv/UTF-8//TRANSLIT
---------------------
UTF-8//TRANSLIT - UTF-8: EvilFile‮.EXE
UTF-8//TRANSLIT - ASCII: EvilFile
UTF-8//TRANSLIT - Windows-1252: EvilFile
UTF-8//TRANSLIT - ISO-8859-15: EvilFile
UTF-8//TRANSLIT - ISO-8859-1: EvilFile
UTF-8//TRANSLIT - ISO-8859-6: EvilFile
UTF-8//TRANSLIT - CP1256: EvilFile
UTF-8//TRANSLIT - US-ASCII//TRANSLIT: EvilFile
UTF-8//TRANSLIT - UTF-8//IGNORE: EvilFile‮.EXE
UTF-8//TRANSLIT - UTF-8//TRANSLIT: EvilFile‮.EXE

mb_convert_encoding
-------------------
pass - EvilFileâ®.EXE
auto - EvilFile‮.EXE
wchar - EvilFileâ®.EXE
byte2be - 䕶楬䙩汥긮䕘
byte2le - 癅汩楆敬胢⺮塅
byte4be - ������������?
byte4le - ������������������
BASE64 - ��)^q
UUENCODE -
HTML-ENTITIES - EvilFileâ®.EXE
Quoted-Printable - EvilFile‮.EXE
7bit - EvilFileâ®.EXE
8bit - EvilFileâ®.EXE
UCS-4 - ������������?
UCS-4BE - ������������?
UCS-4LE - ������������������
UCS-2 - 䕶楬䙩汥긮䕘
UCS-2BE - 䕶楬䙩汥긮䕘
UCS-2LE - 癅汩楆敬胢⺮塅
UTF-32 - ?
UTF-32BE - ?
UTF-32LE -
UTF-16 - 䕶楬䙩汥긮䕘
UTF-16BE - 䕶楬䙩汥긮䕘
UTF-16LE - 癅汩楆敬胢⺮塅
UTF-8 - EvilFile‮.EXE
UTF-7 - EvilFile???.EXE
UTF7-IMAP - EvilFile???.EXE
ASCII - EvilFileâ®.EXE
EUC-JP - EvilFile??EXE
SJIS - EvilFile窶ョ.EXE
eucJP-win - EvilFile??EXE
SJIS-win - EvilFile窶ョ.EXE
CP932 - EvilFile窶ョ.EXE
CP51932 - EvilFile??EXE
JIS - EvilFile??ョ.EXE
ISO-2022-JP - EvilFile??ョ.EXE
ISO-2022-JP-MS - EvilFile??ョ.EXE
Windows-1252 - EvilFile‮.EXE
Windows-1254 - EvilFile‮.EXE
ISO-8859-1 - EvilFileâ®.EXE
ISO-8859-2 - EvilFileâŽ.EXE
ISO-8859-3 - EvilFileâ?.EXE
ISO-8859-4 - EvilFileâŽ.EXE
ISO-8859-5 - EvilFileтЎ.EXE
ISO-8859-6 - EvilFileق?.EXE
ISO-8859-7 - EvilFileβ?.EXE
ISO-8859-8 - EvilFileג®.EXE
ISO-8859-9 - EvilFileâ®.EXE
ISO-8859-10 - EvilFileâŪ.EXE
ISO-8859-13 - EvilFileā®.EXE
ISO-8859-14 - EvilFileâ®.EXE
ISO-8859-15 - EvilFileâ®.EXE
ISO-8859-16 - EvilFileâ®.EXE
EUC-CN - EvilFile??EXE
CP936 - EvilFile鈥?EXE
HZ - EvilFile???.EXE
EUC-TW - EvilFile??EXE
BIG-5 - EvilFile??EXE
EUC-KR - EvilFile??EXE
UHC - EvilFile巽?EXE
ISO-2022-KR - EvilFile???.EXE
Windows-1251 - EvilFile‮.EXE
CP866 - EvilFileтАо.EXE
KOI8-R - EvilFileБ─╝.EXE
KOI8-U - EvilFileБ─╝.EXE
ArmSCII-8 - EvilFileՉ….EXE
CP850 - EvilFileÔÇ«.EXE
JIS-ms - EvilFile??ョ.EXE
CP50220 - EvilFile??ョ.EXE
CP50220raw - EvilFile??ョ.EXE
CP50221 - EvilFile??ョ.EXE
CP50222 - EvilFile??ョ.EXE

I suppose there is (which im not keen on). Pass the string through utf8_encode() and then through preg_replace() to remove the moody chars. But there must be a better/cleaner way.

echo preg_replace('/[^a-z0-9_ \[\]\.\(\)#%&-]/si', '', utf8_encode($evilString)).'<br>';

回答1:

Upon some further tests I added US-ASCII//TRANSLIT - UTF-8//IGNORE so to fix these types of strings without using regex you would use:

echo iconv('US-ASCII//TRANSLIT', 'UTF-8//IGNORE', $evilString); //EvilFile.EXE

Hope this helps anyone in the future with this unique problem.