Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.
Not all documents will contain languages which use the Latin alphabet.
You can surely build your own, given some statistics about letter frequencies, digraph frequencies, etc, of your target languages.
Then release it as open source. And voila, you have an open source engine for detecting the language of text!