When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
When processing text, why would one need a tokenizer specialized for the language?
Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?
The question also implies "What is a word?" and can be quite task-specific (even disregarding multilinguality as one parameter). Here's my try of a subsuming answer:
(Missing) Spaces between words
Compounds
There is also the special case of agglutinating languages; prepositions, possessive pronouns, ... 'attached' to the 'main' word; e.g. Finnish, Hungarian, Turkish in European domains.
Variant styles and codings
Variant coding of information of a certain semantic type E.g. local syntax for phone numbers, dates, ...:
Misc.
One major task is the disambiguation of periods (or interpunctuation in general) and other non-alpha(-numeric) symbols: if e.g. a period is part of the word, keep it that way, so we can distinguish Wash., an abbreviation for the state of Washington, from the capitalized form of the verb wash (MANNI:1999, p.129). Besides cases like this, handling contractions and hyphenation can also not be viewed as a cross-language standard case (even disregarding the missing whitespace-separator).
If one wants to handle multilingual contractions/"cliticons":
Since tokenization and sentence segmentation go hand in hand, they share the same (cross-language) problems. To whom it may concern/wants a general direction:
References
(MANNI:1999) Manning Ch. D., H. Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.
The tokenized version of the parallel text above should look like this:
For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.
However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read
thời_gian
(it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would outputthời_gian
as one token rather thanthời
andgian
.For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a
morphological analyzer
is more useful than a tokenizer in Natural Language Processing.Some languages, like Chinese, don't use whitespace to separate words at all.
Other languages will use punctuation differently - an apostrophe might or might not be a part of a word, for instance.
Case-folding rules vary from language to language.
Stopwords and stemming are different between languauges (though I guess I'm straying from tokenizer to analyzer here).
Edit by Bjerva: Additionally, many languages concatenate compound nouns. Whether this should be tokenised to several tokens or not can not be easily determined using only whitespace.