I have a list of US addresses I need to break into city,state, zip code,state etc.
example address : "16100 Sand Canyon Avenue, Suite 380
Irvine, CA 92618"
Does anyone know of a library or a free API to do this? Google/Yahoo geocoder is forbidden to use by the TOS for commercial projects..
It would be awesome to find a python library that preforms this.
Pyparsing
has a bunch of functionality for parsing street addresses, check out an example for this here: http://pyparsing.wikispaces.com/file/view/streetAddressParser.py
Quite a few of these answers are a few years old now.
The most bulletproof library I've seen recently is usaddress
: https://github.com/datamade/usaddress:
- Far more accurate than
address
which we'd been using for a year now https://pypi.python.org/pypi/address/0.1.1.
- Yet to see it fail on an address
- Still being committed to as of this writing
Pro tip: when testing addresses in all these libraries, use 1) no commas in your address, 2) multi-word city names preferably with "St." in the name to see if the library can differentiate between "street" and "Saint" (e.g., St. Louis), and 3) improper casing. This combo will typically make even the better parsers fall down.
Check out this Python Package:
https://github.com/SwoopSearch/pyaddress
It also allows flexibility if you know enough details about the addresses to be parsed.
That pyparsing library looks very interesting and seems to do a nice job with a variety of examples. And I think that's a more readable alternative to raw regular expressions (which aren't really a good solution for this problem).
Be aware that that kind of solution implies that you will, at some point, be standardizing addresses that aren't valid...they'll just appear valid. If knowing whether an address is in fact, real (and perhaps deliverable) is important to your application then you should be using a USPS-Certified service that using Delivery Point Validation (DPV). I am a developer for SmartyStreets, which provides just such a service, along with SDKs that make integration easy (here's a succinct sample).
The responses come back standardized according to USPS Publication 28. The API is free for low-usage users.
I know this is an old post but someone might find it useful:
https://usaddress.readthedocs.io/en/latest/
>>> import usaddress
>>> usaddress.parse('Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637')
[('Robie', 'BuildingName'),
('House,', 'BuildingName'),
('5757', 'AddressNumber'),
('South', 'StreetNamePreDirectional'),
('Woodlawn', 'StreetName'),
('Avenue,', 'StreetNamePostType'),
('Chicago,', 'PlaceName'),
('IL', 'StateName'),
('60637', 'ZipCode')]
Or:
>>> import usaddress
>>> usaddress.tag('Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637')
(OrderedDict([
('BuildingName', 'Robie House'),
('AddressNumber', '5757'),
('StreetNamePreDirectional', 'South'),
('StreetName', 'Woodlawn'),
('StreetNamePostType', 'Avenue'),
('PlaceName', 'Chicago'),
('StateName', 'IL'),
('ZipCode', '60637')]),
'Street Address')
>>> usaddress.tag('State & Lake, Chicago')
(OrderedDict([
('StreetName', 'State'),
('IntersectionSeparator', '&'),
('SecondStreetName', 'Lake'),
('PlaceName', 'Chicago')]),
'Intersection')
>>> usaddress.tag('P.O. Box 123, Chicago, IL')
(OrderedDict([
('USPSBoxType', 'P.O. Box'),
('USPSBoxID', '123'),
('PlaceName', 'Chicago'),
('StateName', 'IL')]),
'PO Box')
Carefully check your dataset to ensure that this problem hasn't already been handled for you.
I spent a fair amount of time first creating a taxonomy of probably street name ending, using regexp conditionals to try to pluck out the street number from the full address strings and everything and it turned out that the attributes table for my shapefiles had already segmented out these components.
Before you go forward with the process of parsing address strings, which is always a bit of a chore due to the inevitably strange variations (some parcel addresses are for landlocked parcels and have weird addresses, etc), make sure your dataset hasn't already done this for you!!!
There is powerful open-source library libpostal that fits for this use case very nicely. There are bindings to different programming languages. Libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.
I have created a simple Docker image with Python binding pypostal you can spin off and try very easily pypostal-docker