-->

Chinese language in Cassandra

2020-02-15 08:49发布

问题:

I used the chinese letters in Cassandra and it seems the data is entered properly like below,

SELECT * FROM user;

 user_id | user_name    | user_phone
---------+--------------+-------------
      23 |      uSer23, | 12345678910
       5 |       uSer5^ | 12345678910
      28 |     uSer28名 | 12345678910
      10 |      uSer10- | 12345678910
      16 |      uSer16{ | 12345678910
      13 |      uSer13= | 12345678910
      30 |   uSer30一些 | 12345678910
      11 |      uSer11_ | 12345678910
       1 |       uSer1@ | 12345678910
      19 |      uSer19" | 12345678910
       8 |       uSer8( | 12345678910
       0 |       uSer0! | 12345678910
       2 |       uSer2# | 12345678910
       4 |       uSer4% | 12345678910
      18 |      uSer18[ | 12345678910
      15 |      uSer15} | 12345678910
      22 |      uSer22< | 12345678910
      27 |      uSer27/ | 12345678910
      20 |      uSer20: | 12345678910
       7 |       uSer7* | 12345678910
       6 |       uSer6& | 12345678910
      29 |     uSer29称 | 12345678910
       9 |       uSer9) | 12345678910
      14 |      uSer14| | 12345678910
      26 |      uSer26? | 12345678910
      21 |      uSer21; | 12345678910
      17 |      uSer17] | 12345678910
      31 | uSer31区中文 | 12345678910
      24 |      uSer24> | 12345678910
      25 |      uSer25. | 12345678910
      12 |      uSer12+ | 12345678910
       3 |       uSer3$ | 12345678910

I created a index for 'user_name' field like below,

CREATE CUSTOM INDEX user_nontoken_idx ON QCS.user (user_name) 
  USING 'org.apache.cassandra.index.sasi.SASIIndex' 
  WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 
    'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
    'case_sensitive': 'false'}; 

When I do a search using those chinese word, It is searched successfully.

SELECT * FROM user WHERE user_name LIKE '%称%';

How does it actually works? How Cassandra has the capability to store chinese?

回答1:

By default, the text is represented in Cassandra as UTF-8 as it was mentioned in comment.

For your question the main work is done by SASI that gets the data from text column, and apply analyzer to it - and in most cases, for analyzer, the Chinese characters are like other characters. Although if you plan to index text columns, then you may need to look to StandardAnalyzer. But for user names, or something like, NonTokenizingAnalyzer could be better.



回答2:

The ability of understanding language specific strings, comes from the fact that the "TEXT" datatype (of "user_name" column here) is

"UTF-8 encoded string"

in Cassandra. Comparing this with if the column "user_name" would have been stored as "ascii" then it understands only US-ASCII character string set.