Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
Since
big5
andgb2312
omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between thetranslit
andignore
modes would fail in quite a lot of normal use cases: it would fail to identify説話
as Traditional Chinese despite説
being a common variant in Hong Kong for說
which is used inbig5
.A simple fix is to do it in a fuzzy way:
I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.