Testing for Japanese/Chinese Characters in a strin

I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.

I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?

标签： string unicode language-detection

6条回答

兄弟一词,经得起流年.

2楼-- · 2019-01-25 15:55

The process of developing Unicode included the Han Unification. This is because a lot of the Japanese characters are derived from, or the same as, Chinese characters; similarly with Korean. There are some characters (katakana and hiragana - see chapter 12 of the Unicode standard v5.1.0) commonly used in Japanese that would indicate that the text was Japanese rather than Chinese, but I believe it would be a statistical test rather than definitive.

Check out the O'Reilly book on CJKV Information Processing (CJKV is short for Chinese, Japanese, Korean, Vietnamese; I have the CJK predecessor lurking somewhere). There's also the O'Reilly book on Unicode Explained which may be some help, though probably not for this question (I don't recall a discussion of how to identify Japanese and Chinese text).

0人赞添加讨论(0) 举报

姐就是有狂的资本

3楼-- · 2019-01-25 15:56

There are many characters which are only (commonly) used in Japanese or only used in Chinese.

Japan and China both simplified many characters but often in different ways. You can check for Japanese Shinjitai and Simplified Chinese characters. There are many more of the latter than the former. If there are none of either then you probably have Traditional Chinese.

Of course if you're dealing with Unicode text you may find occasional rare characters or mixed languages which could throw off a heuristic so you're better off going with counting the types of characters to make a judgement.

A good way to find out which characters are common in one language and not in the others is to compare the legacy encodings against each other. You can find mappings of each to Unicode easily on the internet.

I used to have some code I wrote which did a binary search by codepoint and it was extremely fast even in JavaScript - I may have lost it in my travels though (-:

0人赞添加讨论(0) 举报

你好瞎i

4楼-- · 2019-01-25 15:58

testing for characters in the katakana or hiragana ranges should be a very reliable means of determining whether or not the text is Japanese, especially if you are dealing with 'regular' user-generated text. if you are looking at legal documents or other more official fare it might be slightly more difficult, as there will be a much greater preponderance of complex chinese characters - but it should still be pretty reliable.

0人赞添加讨论(0) 举报

手持菜刀，她持情操

5楼-- · 2019-01-25 16:01

You probably can't do that reliably. Japanese uses a lot of the same characters as Chinese. I think the best you could do is to look at a block of text. If you see any uniquely Japanese characters, then you can assume the whole block is Japanese. If not, then it's probably Chinese.

However, I'm just learning Chinese, so I'm not an expert.

0人赞添加讨论(0) 举报

太酷不给撩

6楼-- · 2019-01-25 16:02

You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.

Simplifications - if the character you are testing is a PRC simplification such as 门 is only available in main land Chinese.
Kana - if the character is one of the many Japanese kana characters such as あいうえお　then the text block you are working with is definitely Japanese.

The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.

0人赞添加讨论(0) 举报

做个烂人

7楼-- · 2019-01-25 16:03

A workaround is to check the encoding before it is converted to Unicode.

0人赞添加讨论(0) 举报

Testing for Japanese/Chinese Characters in a strin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间