In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other languages also. I love this feature and would like to understand how they do it.
The naive way to do this would be to have many regular expressions and run them all. However I this is not going to scale very well and will work for only a specific language or date format, etc. I think that Apple must be using some concept of machine learning to extract entities (8:00PM, 8PM, 8:00, 0800, 20:00, 20h, 20h00, 2000 etc.).
Any idea how Apple is able to extract entities so quickly in its email client? What machine learning algorithm would you to apply accomplish such task?
Apple has a patent on how they did it System and method for performing an action on a structure in computer data, and here's a story on this patent apples-patent-on-nsdatadetector
I once wrote a parser to do this, using pyparsing. It's really very simple, you just need to get all the different ways right, but there aren't that many. It only took a few hours and was pretty fast.
That's a technology Apple actually developed a very long time ago called
Apple Data Detectors
. You can read more about it here:http://www.miramontes.com/writing/add-cacm/
Essentially it parses the text and detects patterns that represent specific pieces of data, then applies OS-contextual actions to it. It's neat.
One part of the puzzle could be the
NSDataDetector
class. Its used to recognize some standard types like phone numbers.They likely use Information Extraction techniques for this.
Here is a demo of Stanford's SUTime tool:
http://nlp.stanford.edu:8080/sutime/process
You would extract attributes about n-grams (consecutive words) in a document:
...
And then use a classification algorithm, and feed it positive and negative examples:
You might get away with 50 examples of each, but the more the merrier. Then, the algorithm learns based on those examples, and can apply to future examples that it hasn't seen before.
It might learn rules such as
Here is a decent video by a Google engineer on the subject
This is called temporal expression identification and parsing. Here are some Google searches to get you started:
https://www.google.com/#hl=en&safe=off&sclient=psy-ab&q=timebank+timeml+timex
https://www.google.com/#hl=en&safe=off&sclient=psy-ab&q=temporal+expression+tagger