As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.
Some of the caveats:
- cardinal/nominal or ordinal: "one" and "first"
- common spelling mistakes: "forty"/"fourty"
- hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
- separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
- colloquialisms: "thirty-something"
- fractions: 'one third', 'two fifths'
- common names: 'a dozen', 'half'
And there are probably more caveats possible that are not yet listed. Suppose the algorithm needs to be very robust, and even understand spelling mistakes.
What fields/papers/studies/algorithms should I read to learn how to write all this? Where is the information?
PS: My final parser should actually understand 3 different languages, English, Russian and Hebrew. And maybe at a later stage more languages will be added. Hebrew also has male/female numbers, like "one man" and "one woman" have a different "one" — "ehad" and "ahat". Russian also has some of its own complexities.
Google does a great job at this. For example:
(the reverse is also possible http://www.google.com/search?q=999999999999+in+english)
Ordinal numbers are not applicable because they cant be joined in meaningful ways with other numbers in language (...at least in English)
e.g. one hundred and first, eleven second, etc...
However, there is another English/American caveat with the word 'and'
i.e.
one hundred and one (English) one hundred one (American)
Also, the use of 'a' to mean one in English
a thousand = one thousand
...On a side note Google's calculator does an amazing job of this.
one hundred and three thousand times the speed of light
And even...
two thousand and one hundred plus a dozen
...wtf?!? a score plus a dozen in roman numerals
My LPC implementation of some of your requirements (American English only):
One place to start looking is the gnu get_date lib, which can parse just about any English textual date into a timestamp. While not exactly what you're looking for, their solution to a similar problem could provide a lot of useful clues.
I have some code I wrote a while ago: text2num. This does some of what you want, except it does not handle ordinal numbers. I haven't actually used this code for anything, so it's largely untested!
You should keep in mind that Europe and America count differently.
European standard:
Here is a small reference on it.
A simple way to see the difference is the following:
Here is an extremely robust solution in Clojure.
AFAIK it is a unique implementation approach.
Here are some examples