I am trying to convert a list of Country Name Data to ISO3166 Country Codes (alpha3) using the pycountries library. My basic function is as:
import pycountries as pc
def guess_country(data, output='alpha3', verbose=False):
#Check Data isn't already in Alpha3
try:
country = pc.countries.get(alpha3=data)
return country
except:
pass #KeyError Raised, data doesn't directly match
#Check if Country is Actual CountryName
try:
country = pc.countries.get(name=data)
return country
except:
pass #KeyError Raised, data doesn't directly match
#Check RegExpr of 'data' in an attempt to match
The issue is that the CountryName data is rather dirty ... a short list of sample is
GUATMAL, CHINA T, COLOMB, MEXICO, HG KONG
Does anyone know if there is a package that returns the best 'guess' match given a cntry_name? I would be happy for some to be rejected based on difficulty (i.e. China T -> Taiwan). It would be nice if the best_guess returned a measure of certainty regarding the 'guess'.
You could use difflib (built into python) to select close country names:
matching_countries would contain a list of country names that are similar. You can specify the number of results returned and the sensitivity of the matching using the option n and cutoff arguments.
NOTE: the get_close_matches method is case sensitive, so you may want to convert everything to lower case before looking for matches.
I ran the sample data you had through get_close_matches, and it worked for all but Taiwan.
If you have a number of known tricky matches, it could be worth putting a dictionary of common difficult input data like 'China T' to manually handle these exceptions. Of course, if the input data is relatively consistent using a simple dictionary lookup may be the best option.