php find emoji [update existing code]

2019-03-21 04:25发布

问题:

I'm trying to detect emoji in my php code, and prevent users entering it.

The code I have is:

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) > 0)
{
    //warning...
}

But doesn't work for all emoji. Any ideas?

回答1:

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) 

You really want to match Unicode at a character level, rather than trying to keep track of UTF-8 byte sequences. Use the u modifier to treat your UTF-8 string on a character basis.

The emoji are encoded in the block U+1F300–U+1F5FF. However:

  • many characters from Japanese carriers' ‘emoji’ sets are actually mapped to existing Unicode symbols, eg the card suits, zodiac signs and some arrows. Do you count these symbols as ‘emoji’ now?

  • there are still systems which don't use the newly-standardised Unicode emoji code points, instead using ad-hoc ranges in the Private Use Area. Each carrier had their own encodings. iOS 4 used the Softbank set. More info. You may wish to block the entire Private Use Area.

eg:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

if (preg_match('/['.
    unichr(0x1F300).'-'.unichr(0x1F5FF).
    unichr(0xE000).'-'.unichr(0xF8FF).
']/u'), $value) {
    ...
}


回答2:

From wikipedia:

The core emoji set as of Unicode 6.0 consists of 722 characters, of which 114 characters map to sequences of one or more characters in the pre-6.0 Unicode standard, and the remaining 608 characters map to sequences of one or more characters introduced in Unicode 6.0.[4] There is no block specifically set aside for emoji – the new symbols were encoded in seven different blocks (some newly created), and there exists a Unicode data file called EmojiSources.txt that includes mappings to and from the Japanese vendors' legacy character sets.

Here is the mapping file. There are 722 lines in the file, each one representing one of the 722 emoticons.

It seems like this is not an easy thing to do because there is not a specific block set aside for emoji. You need to adjust your regex to cover all of the emoji unicodes.

You could match an individual unicode like so:

\x{1F30F}

1F30F is the unicode for an emoticon of a globe.

Sorry I don't have a full answer for you, but this should get you headed in the right direction.



回答3:

The right answer is to detect where you have an assigned code point in the Miscellaneous_Symbols_And_Pictographs block. In Perl, you’d use

 /\p{Assigned}/ && \p{block=Miscellaneous_Symbols_And_Pictographs}/

or just

/\P{Cn}/ && /\p{Miscellaneous_Symbols_And_Pictographs}/

which you should combine those into one pattern with

/(?=\p{Assigned})\p{Miscellaneous_Symbols_And_Pictographs}/

I don’t recall whether the PCRE library that PHP uses gives you access to the requisite Unicode character properties. My recollection is that it’s pretty weak in that particular area. I think you only have Unicode script properties and general categories. Sigh.

Sometimes you just have to use the real thing.

For lack of decent Unicode support, you may have to enumerate the block yourself:

/(?=\P{Cn})[\x{1F300}-\x{1F5FF}]/

Looks like a maintenance nightmare to me, full of magic numbers.



回答4:

That's what I came up with today. It's probably not a good solution for this problem, but at least it works ;)

if(iconv('Windows-1250', 'UTF-8', iconv('UTF-8', 'Windows-1250', $value)) != $value)