I'm looking for a simple way to detect whether a short excerpt of text, a few sentences, is English or not. Seems to me that this problem is much easier than trying to detect an arbitrary language. Is there any software out there that can do this? I'm writing in python, and would prefer a python library, but something else would be fine too. I've tried google, but then realized the TOS didn't allow automated queries.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
I read a method to detect English language by using Trigrams
You can go over the text, and try to detect the most used trigrams in the words. If the most used ones match with the most used among english words, the text may be written in English
Try to look in this ruby project:
https://github.com/feedbackmine/language_detector
Google Translate API v2 allows automated queries but it requires the use of an API key that you can freely get at Google APIs console.
To detect whether text is English you could use
detect_language_v2()
function (that uses that API) from my answer to the question Python - can I detect unicode string language code?:I recently wrote a solution for this. My solution is not fool proof and I do not think it would be computationally viable for large amounts of text, but it seems to me to work well for smallish sentences.
Suppose you have two strings of text:
The goal then is to determine that 1. is probably English while 2. is not. Intuitively, the way my mind determines this is by looking for the word boundaries of English words in the sentences (LET, ME, BEGIN, etc.). But this is not straightforward computationally because there are overlapping words (BE, GIN, BEGIN, SAY, SAYING, THANK, THANKS, etc.).
My method does the following:
{ known English words }
and{ all substrings of the text of all lengths }
.(0)
would beL
, so "LET" could be represented by(0) -> (3)
, where(3)
isM
so that's "LET ME".n
between 0 andlen(text)
for which a simple directed path exists from index 0 to indexn
.n
by the length of the text to get a rough idea of what percent of the text appears to be consecutive English words.Note that my code assumes no spaces between words, but I imagine you could adapt it to consider spaces fairly easily. Not that for my code to work you need an English wordlist file. I got one from here, but you can use any such file, and I imagine in this way this technique could be extended to other languages too.
Here is the code:
And here is
I/O
for the initial examples I gave:So then approximately speaking, I am 96% certain that
LETMEBEGINBYSAYINGTHANKS
is English, and 8% certain thatUNGHSYINDJFHAKJSNFNDKUAJUD
is English. Which sounds about right!To extend this to much larger pieces of text, my suggestion would be to subsample random short substrings and check their "englishness". Hope this helps!
Altough not as good as Google's own, I have had good results using Apache Nutch LanguageIdentifier which comes with its own pretrained ngram models. I had quite good results on a large (50GB pdf, text-mostly) corpus of real-world data in several languages.
It is in Java, but I'm sure you can reread the ngram profiles from it if you want to reimplement it in Python.
EDIT: This won't work in this case, since OP is processing text in bulk which is against Google's TOS.
Use the Google Translate language detect API. Python example from the docs: