As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.
Some of the caveats:
- cardinal/nominal or ordinal: "one" and "first"
- common spelling mistakes: "forty"/"fourty"
- hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
- separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
- colloquialisms: "thirty-something"
- fractions: 'one third', 'two fifths'
- common names: 'a dozen', 'half'
And there are probably more caveats possible that are not yet listed. Suppose the algorithm needs to be very robust, and even understand spelling mistakes.
What fields/papers/studies/algorithms should I read to learn how to write all this? Where is the information?
PS: My final parser should actually understand 3 different languages, English, Russian and Hebrew. And maybe at a later stage more languages will be added. Hebrew also has male/female numbers, like "one man" and "one woman" have a different "one" — "ehad" and "ahat". Russian also has some of its own complexities.
Google does a great job at this. For example:
(the reverse is also possible
Ordinal numbers are not applicable because they cant be joined in meaningful ways with other numbers in language ( least in English)
e.g. one hundred and first, eleven second, etc...
However, there is another English/American caveat with the word 'and'
one hundred and one (English) one hundred one (American)
Also, the use of 'a' to mean one in English
a thousand = one thousand
...On a side note Google's calculator does an amazing job of this.
one hundred and three thousand times the speed of light
And even...
two thousand and one hundred plus a dozen!? a score plus a dozen in roman numerals
My LPC implementation of some of your requirements (American English only):
One place to start looking is the gnu get_date lib, which can parse just about any English textual date into a timestamp. While not exactly what you're looking for, their solution to a similar problem could provide a lot of useful clues.
I have some code I wrote a while ago: text2num. This does some of what you want, except it does not handle ordinal numbers. I haven't actually used this code for anything, so it's largely untested!
You should keep in mind that Europe and America count differently.
European standard:
Here is a small reference on it.
A simple way to see the difference is the following:
Here is an extremely robust solution in Clojure.
AFAIK it is a unique implementation approach.
Here are some examples