Why do I need a tokenizer for each language? [clos

2019-04-20 04:33发布

When processing text, why would one need a tokenizer specialized for the language?

Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?

3条回答
闹够了就滚
2楼-- · 2019-04-20 04:55

The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:

(Missing) Spaces between words

Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts, such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without word spaces. Spaces were introduced (together with accent marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task. (MANNI:1999, p. 129)

Compounds

German compound nouns are written as a single word, e.g. "Kartellaufsichtsbehördenangestellter" (an employee at the "Anti-Trust agency"), and compounds de facto are single words -- phonologically (cf. (MANNI:1999, p. 120)). Their information-density, however, is high, and one may wish to divide such a compound, or at least to be aware of the internal structure of the word, and this becomes a limited word segmentation task.(Ibidem)

There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.

Variant styles and codings

Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:

[...] Even if one is not dealing with multilingual text, any application dealing with text from different countries or written according to different stylistic conventions has to be prepared to deal with typographical differences. In particular, some items such as phone numbers are clearly of one semantic sort, but can appear in many formats. (MANNI:1999, p. 130)

Misc.

One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).

If one wants to handle multilingual contractions/"cliticons":

  • English: They‘re my father‘s cousins.
  • French: Montrez-le à l‘agent!
  • German: Ich hab‘s ins Haus gebracht. (in‘s is still a valid variant)

Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:

  • Kiss, Tibor and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics32(4), p. 485-525.
  • Palmer, D. and M. Hearst. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics, 23(2), p. 241-267.
  • Reynar, J. and A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. Proceedingsof the Fifth Conference on Applied Natural Language Processing, p. 16-19.

References

(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

查看更多
迷人小祖宗
3楼-- · 2019-04-20 05:10

Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.

Chinese: 如果您在新加坡只能前往一间夜间娱乐场所,Zouk必然是您的不二之选。

English: If you only have time for one club in Singapore, then it simply has to be Zouk.

Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.

Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。

Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.

Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.

Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf

The tokenized version of the parallel text above should look like this:

enter image description here

For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.

However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.

For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.

查看更多
倾城 Initia
4楼-- · 2019-04-20 05:17

Some languages, like Chinese, don't use whitespace to separate words at all.

Other languages will use punctuation differently - an apostrophe might or might not be a part of a word, for instance.

Case-folding rules vary from language to language.

Stopwords and stemming are different between languauges (though I guess I'm straying from tokenizer to analyzer here).

Edit by Bjerva: Additionally, many languages concatenate compound nouns. Whether this should be tokenised to several tokens or not can not be easily determined using only whitespace.

查看更多
登录 后发表回答