new column with coordinates using geopy pandas

2019-03-09 12:18发布

问题:

I have a df:

import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty

df


     city_name  state_name  county_name
0    WASHINGTON  DC  DIST OF COLUMBIA
1    WASHINGTON  DC  DIST OF COLUMBIA
2    WASHINGTON  DC  DIST OF COLUMBIA
3    WASHINGTON  DC  DIST OF COLUMBIA
4    WASHINGTON  DC  DIST OF COLUMBIA
5    WASHINGTON  DC  DIST OF COLUMBIA
6    WASHINGTON  DC  DIST OF COLUMBIA
7    WASHINGTON  DC  DIST OF COLUMBIA
8    WASHINGTON  DC  DIST OF COLUMBIA
9    WASHINGTON  DC  DIST OF COLUMBIA

I want to get the latitude and longitude coordinates for any one of the columns in the data frame below. The documentation (http://geopy.readthedocs.org/en/latest/#data) is pretty straightforward when working with the documentation for individual locations.

>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York,     ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}

However I want to apply the function to each row in the df and make a new column. I've tried the following

df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))

but I think I'm missing something in my code because I get the following:

    city_name   state_name  county_name coordinates
0    WASHINGTON  DC  DIST OF COLUMBIA    None
1    WASHINGTON  DC  DIST OF COLUMBIA    None
2    WASHINGTON  DC  DIST OF COLUMBIA    None
3    WASHINGTON  DC  DIST OF COLUMBIA    None
4    WASHINGTON  DC  DIST OF COLUMBIA    None
5    WASHINGTON  DC  DIST OF COLUMBIA    None
6    WASHINGTON  DC  DIST OF COLUMBIA    None
7    WASHINGTON  DC  DIST OF COLUMBIA    None
8    WASHINGTON  DC  DIST OF COLUMBIA    None
9    WASHINGTON  DC  DIST OF COLUMBIA    None

I would like something like this hopefully using the Lambda function:

     city_name  state_name  county_name  city_coord
0    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
1    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
2    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
3    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
4    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
5    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
6    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
7    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
8    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456 
9    WASHINGTON  DC  DIST OF COLUMBIA    38.8949549, -77.0366456
10   GLYNCO      GA  GLYNN               31.2224512, -81.5101023

I appreciate any help. After I get the coordinates I'd like to map them. Any recommended resources for mapping coordinates is greatly appreciated too. thanks

回答1:

You can call apply and pass the function you want to execute on every row like the following:

In [9]:

geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
    city_name state_name       county_name  \
0  WASHINGTON         DC  DIST OF COLUMBIA   
1  WASHINGTON         DC  DIST OF COLUMBIA   

                                          city_coord  
0  (District of Columbia, United States of Americ...  
1  (District of Columbia, United States of Americ...  

You can then access the latitude and longitude attributes:

In [16]:

df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

Or do it in a one liner by calling apply twice:

In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df

Out[17]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

Also your attempt geolocator.geocode(lambda row: 'state_name' (row)) did nothing hence why you have a column full of None values

EDIT

@leb makes an interesting point here, if you have many duplicate values then it'll be more performant to geocode for each unique value and then add this:

In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d

Out[38]:
{'DC': (38.8937154, -76.9877934586326)}

In [40]:    
df['city_coord'] = df['state_name'].map(d)
df

Out[40]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

So the above gets all the unique values using unique, constructs a dict from them and then calls map to perform the lookup and add the coords, this will be more efficient than trying to geocode row-wise



回答2:

Upvote and accept @EdChum's answer, I just wanted to add to this. His methods works perfect, but from personal experience I'd like to share a few things:

When dealing with geocoding, if you have multiple city/state combination that are repeating, it's much faster to send only 1 to get geocoded and then replicate the rest to other rows below:

This is very helpful for large data can be done through two ways:

  1. Based on your data only since the rows seem exact duplicate, and only if you want, drop the extra ones and execute geocoding to one of them. This can be done using drop_duplicate
  2. If you want to keep all your rows, group_by the city/state combination, apply geocoding to it the first one by calling head(1), then duplicate to the remainder rows.

Reason is each time you call on Nominatim there's a small latency issue even if you were queuing the same city/state in a row. This small latency gets worse when your data gets large causing a huge delay in response and possible time out.

Again, this is all from personanly dealing with it. Just keep in mind for future use if it doesn't benefit you now.