How can I detect certain Unicode characters in a s

2019-01-13 14:05发布

Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.

class String
  def contains_cjk?
    ...
  end
end

>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false

I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.

4条回答
干净又极端
2楼-- · 2019-01-13 14:28

I've written a little gem that packages up the approach in steenslag's answer above:

https://github.com/jpatokal/script_detector

It can also take a stab at differentiating between Japanese, Korean, simplified Chinese and traditional Chinese, although due to the complexities of Han unification it only works reliably with large slabs of text.

查看更多
ら.Afraid
3楼-- · 2019-01-13 14:30

Given my Ruby 1.8.7 constraint, this is the best I could do:

class String
  CJKV_RANGES = [
      (0xe2ba80..0xe2bbbf),
      (0xe2bfb0..0xe2bfbf),
      (0xe38080..0xe380bf),
      (0xe38180..0xe383bf),
      (0xe38480..0xe386bf),
      (0xe38780..0xe387bf),
      (0xe38880..0xe38bbf),
      (0xe38c80..0xe38fbf),
      (0xe39080..0xe4b6bf),
      (0xe4b780..0xe4b7bf),
      (0xe4b880..0xe9bfbf),
      (0xea8080..0xea98bf),
      (0xeaa080..0xeaaebf),
      (0xeaaf80..0xefbfbf),
  ]

  def contains_cjkv?
    each_char do |ch|
      return true if CJKV_RANGES.any? {|range| range.member? ch.unpack('H*').first.hex }
    end
    false
  end
end


strings = ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each {|s| puts s.contains_cjkv? }

#true
#true
#true
#false

Pretty hacktacular, but it works. It actually detects a variety of Indic scripts as well, so it should probably really be called contains_asian?

Maybe I should gem this up for other poor I18N hackers stuck with Ruby 1.8.

查看更多
狗以群分
4楼-- · 2019-01-13 14:46

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

查看更多
Evening l夕情丶
5楼-- · 2019-01-13 14:48

Ruby 1.8 solution based on this code and using the API from Josh Glover's solution on this thread:

class String
  CJKV_RANGES = [
    (0x4E00..0x9FFF),
    (0x3400..0x4DBF),
    (0x20000..0x2A6DF),
    (0x2A700..0x2B73F),
  ]

  def contains_cjkv?
    unpack("U*").any? { |char|
      CJKV_RANGES.any? { |range| range.member?(char) }
    }
  end
end
查看更多
登录 后发表回答