可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to detect emoji in my php code, and prevent users entering it.

The code I have is:

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) > 0)
{
    //warning...
}

But doesn't work for all emoji. Any ideas?

回答1:

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value)

You really want to match Unicode at a character level, rather than trying to keep track of UTF-8 byte sequences. Use the u modifier to treat your UTF-8 string on a character basis.

The emoji are encoded in the block U+1F300–U+1F5FF. However:

many characters from Japanese carriers' ‘emoji’ sets are actually mapped to existing Unicode symbols, eg the card suits, zodiac signs and some arrows. Do you count these symbols as ‘emoji’ now?
there are still systems which don't use the newly-standardised Unicode emoji code points, instead using ad-hoc ranges in the Private Use Area. Each carrier had their own encodings. iOS 4 used the Softbank set. More info. You may wish to block the entire Private Use Area.

eg:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

if (preg_match('/['.
    unichr(0x1F300).'-'.unichr(0x1F5FF).
    unichr(0xE000).'-'.unichr(0xF8FF).
']/u'), $value) {
    ...
}

回答2:

From wikipedia:

The core emoji set as of Unicode 6.0 consists of 722 characters, of which 114 characters map to sequences of one or more characters in the pre-6.0 Unicode standard, and the remaining 608 characters map to sequences of one or more characters introduced in Unicode 6.0.[4] There is no block specifically set aside for emoji – the new symbols were encoded in seven different blocks (some newly created), and there exists a Unicode data file called EmojiSources.txt that includes mappings to and from the Japanese vendors' legacy character sets.

Here is the mapping file. There are 722 lines in the file, each one representing one of the 722 emoticons.

It seems like this is not an easy thing to do because there is not a specific block set aside for emoji. You need to adjust your regex to cover all of the emoji unicodes.

You could match an individual unicode like so:

\x{1F30F}

1F30F is the unicode for an emoticon of a globe.

Sorry I don't have a full answer for you, but this should get you headed in the right direction.

回答3:

The right answer is to detect where you have an assigned code point in the Miscellaneous_Symbols_And_Pictographs block. In Perl, you’d use

 /\p{Assigned}/ && \p{block=Miscellaneous_Symbols_And_Pictographs}/

or just

/\P{Cn}/ && /\p{Miscellaneous_Symbols_And_Pictographs}/

which you should combine those into one pattern with

/(?=\p{Assigned})\p{Miscellaneous_Symbols_And_Pictographs}/

I don’t recall whether the PCRE library that PHP uses gives you access to the requisite Unicode character properties. My recollection is that it’s pretty weak in that particular area. I think you only have Unicode script properties and general categories. Sigh.

Sometimes you just have to use the real thing.

For lack of decent Unicode support, you may have to enumerate the block yourself:

/(?=\P{Cn})[\x{1F300}-\x{1F5FF}]/

Looks like a maintenance nightmare to me, full of magic numbers.

回答4:

That's what I came up with today. It's probably not a good solution for this problem, but at least it works ;)

if(iconv('Windows-1250', 'UTF-8', iconv('UTF-8', 'Windows-1250', $value)) != $value)

php find emoji [update existing code]

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

php find emoji [update existing code]

问题:

回答1:

回答2:

回答3:

回答4:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮