What is a relatively simple way to determine the p

I have a number of strings (collections of characters) that represent sentences in different languages, say:

Hello, my name is George.

Das brot ist gut.

... etc.

I want to assign each of them scores (from 0 .. 1) indicating the likelihood that they are English sentences. Is there an accepted algorithm (or Python library) from which to do this?

Note: I don't care if the grammar of the English sentence is perfect.

回答1:

A bayesian classifier would be a good choice for this task:

>>> from reverend.thomas import Bayes
>>> g = Bayes()    # guesser
>>> g.train('french','La souris est rentrÃ©e dans son trou.')
>>> g.train('english','my tailor is rich.')
>>> g.train('french','Je ne sais pas si je viendrai demain.')
>>> g.train('english','I do not plan to update my website soon.')

>>> print g.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print g.guess('Demain il fera trÃ¨s probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

回答2:

I know the answer has been accepted, however... usually language id is done with character n-gram models, not bag-of-words models as Raymond suggests. This is not the same as using n-gram features in a classifier (indeed, using a classifier isn't usually done, or really necessary, at least not in the conventional sense). The reason for this is that often just a few characters can be sufficient to do language identification, whereas bag-of-words based classifiers (and more so bag-of-ngrams) require the same words or phrases to be used as were seen in training. Character based models on the other hand can be used with little training and almost no data from which to do the identification.

Here's how it works. We look at a string as a sequence of the characters it contains (including the spaces and punctuation marks). We build an n-gram language model of these character sequences, where n=3 ought to be sufficient but you'll get more accuracy using n=5 or n=6 (at the expense of needing to do proper smoothing, which may or may not be easy depending on how you end up doing this)! Let's say we have a character n-gram model, where n=3, for two languages, French and English. Under this model, the probability of a string:

c = c_1, c_2 ... c_n

where each c_i is a character (including spaces, punctuation, etc) is:

p(c) = p(c_1) * p(c_2 | c_1) * p(c_3|c_2,c_1)...p(c_n|c_n-1,c_n-2)

now if we have models for French and English, what this translates to is a set of parameters to this distribution for each language. These are really just tables giving the conditional probabilities of c_i given (c_i-1,c_i-2), for which the maximum likelihood estimator is just:

count(c_i-2, c_i-1, c) / count(c_i-2, c_i-1)

although maximum likelihood estimation is basically never done for language modelling because of the problem of getting 0 probabilities, the likelihood function above will play a big part in the parameter estimates (it will just need smoothing).

So, all you do to decide which language the string c is in, is evaluate its probability under the language models you've trained for the languages you're interested in, and judge the string to be from the model assigning it the highest probability (this is equivalent to a Bayesian classifier with a uniform prior on classes, i.e. languages, but where the assumed distribution is that of an n-gram model, not a Naive Bayes/multinomial).

There are a lot of places to read about language modelling: a very good tutorial can be found in Josh Goodman's epic (although it's a bit out of date now, the ideas remain unchanged and will be more than adequate for your purposes). You can take a look at the wikipedia page, where you'll see that the unigram model is equivalent to a multinomial distribution.

And finally, if you're looking for a Python implementation of language models, probably the most widely used is NLTK.

回答3:

You can find few suggestion for python libraries here and here.

Another simple algorithm, if you have a corpus for each language, you can detect the sentence language with a simple look-up in words frequency table.