Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.
Not all documents will contain languages which use the Latin alphabet.
I don't think you need anything very sophisticated - for example to detect if a document is in English, with a pretty high level of certainty, simply test if it contains the N most common English words - something like:
If it contains all of those, I would say it is almost definitely English.
Check out Franc on Github. It's written in JavaScript, so you could use in a browser and maybe in Node too.
Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.
In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.
Try CLD2:
Installation
Run
Gives
Others
You could alternatively try Ruby's WhatLanguage gem, it's nice and simple and I've used in for Twitter data analysis. Check out: http://www.youtube.com/watch?v=lNqZ2cqOReo&list=UUJ_3fstMOH-g4yBxtvgAWkw&index=0&feature=plcp for a quick demo
For future reference, the engine I ended up using is libtextcat which is under BSD license but seems not to be maintained since 2003. Still, it does a good job and integrates easily in my toolchain