pycountries: Convert Country Names (Possibly Incom

2020-07-25 01:49发布

问题:

I am trying to convert a list of Country Name Data to ISO3166 Country Codes (alpha3) using the pycountries library. My basic function is as:

import pycountries as pc

def guess_country(data, output='alpha3', verbose=False):
    #Check Data isn't already in Alpha3
    try:
        country = pc.countries.get(alpha3=data)
        return country
    except: 
        pass  #KeyError Raised, data doesn't directly match
    #Check if Country is Actual CountryName
    try:
        country = pc.countries.get(name=data)
        return country
    except:
        pass #KeyError Raised, data doesn't directly match
     #Check RegExpr of 'data' in an attempt to match

The issue is that the CountryName data is rather dirty ... a short list of sample is

GUATMAL, CHINA T, COLOMB, MEXICO, HG KONG

Does anyone know if there is a package that returns the best 'guess' match given a cntry_name? I would be happy for some to be rejected based on difficulty (i.e. China T -> Taiwan). It would be nice if the best_guess returned a measure of certainty regarding the 'guess'.

回答1:

You could use difflib (built into python) to select close country names:

import difflib
country_names = [x.name.lower() for x in pycountry.countries]    
matching_countries = difflib.get_close_matches(data, country_names)
confidence = difflib.SequenceMatcher(None, matching_countries[0], data).ratio()

matching_countries would contain a list of country names that are similar. You can specify the number of results returned and the sensitivity of the matching using the option n and cutoff arguments.

NOTE: the get_close_matches method is case sensitive, so you may want to convert everything to lower case before looking for matches.

I ran the sample data you had through get_close_matches, and it worked for all but Taiwan.

If you have a number of known tricky matches, it could be worth putting a dictionary of common difficult input data like 'China T' to manually handle these exceptions. Of course, if the input data is relatively consistent using a simple dictionary lookup may be the best option.



标签: python pandas