I'm looking for advice on parsing input from a user in multiple combinations of City / State / Zip Code / Country.
A common example would be what Google maps does.
Some examples of input would be:
- "City, State, Country"
- "City, Country"
- "City, Zip Code, Country"
- "City, State, Zip Code"
- "Zip Code"
What would be an efficient and correct way to parse this input from a user?
If you are aware of any example implementations please share :)
The first step would be to break up the text into individual tokens using spaces or commas as the delimiting characters. For scalability, you can then hand each token to a thread or server (if using a Map-Reducer like architecture) to figure out what each token is. For instance,
- If we have numbers in the pattern, then it's probably a zip code.
- Is the item in the list of known states?
- Countries are also fairly easy to handle like states, there's a limited number.
- What order are the tokens in compared to the common ways of writing an address? Most input will probably follow the local post office custom for address formats.
Once you have the individual token results, you can glue the parts back together to get a full address. In the cases where there are questions, you can prompt the user what they really meant (like Google maps) and add that information to a learned list.
The easiest method to add that support to an applications, assuming you're not trying to build a map system, is to query Google or Yahoo and ask them to parse the date for you.
I am myself very fascinated with how Google handles that. I do not remember seeing anything similar anywhere else.
I believe, you try to separate an input string in words trying various delimeters - space, comma, semicolon etc. Then you have several combinations. For each combination, you take each words and match it against country, city, town, postal code database. Then you define some metric on how to evaluate the group match result for each combination. Here should also be cross rules, like if the postal code does not match well, but country, city, town match well and in combination refer to a valid address then the metric yields a high mark.
It is sure difficult and not an evening code exercise. It also requires strong computational resources - a shared hosting would probably crack under just 10 requests, but a data center could serve it well.
Not sure if there is an example implementation. Many geographical services are offered on paid basis. Something that sophisticated as GoogleMaps would likely cost a fortune.
Correct me if I'm wrong.
I found a simple PHP implementation
- http://www.eotz.com/2008/07/parsing-location-string-php/
Yahoo seems to have a webservice that offers the functionality (sort of)
- http://developer.yahoo.com/geo/placemaker/
Openstreetmap seems to offer the same search functionality on its homepage
- http://www.openstreetmap.org/
Assuming you're only dealing with those four fields (City Zip State Country), there are finite values for all fields except for City, and even that I guess if you have a big city list is also finite. So just split each field by comma then check against each field list.
Assuming we're talking US addresses-
- Zip is most obvious, so check for
that first.
- State has 50x2 options
(California or CA), check that next
- Country has ~190x2 options, depending
on how encompassing you want to be
(US, United States, USA).
- Whatever is left over is probably your City.
As far as efficiency goes, it might make sense to check a handful of 'standard' formats first, like Dan suggests.