fuzzy match between 2 columns (Python)

2019-05-30 03:13发布

问题:

I have a pandas dataframe called "df_combo" which contains columns "worker_id", "url_entrance", "company_name". I am trying to produce an output column that would tell me if the URLs in "url_entrance" column contains any word in "company_name" column. Even a close match like fuzzywuzzy would work.

For example, if the URL is "www.grandhotelseattle.com" and the "company_name" is "Hotel Prestige Seattle", then the fuzz ratio might be somewhere 70-80.

I have tried the following script: >>>fuzz.ratio(df_combo['url_entrance'],df_combo['company_name']) but it returns only 1 number which is the overall fuzz ratio for the whole column. I would like to have fuzz ratio for every row and store those ratios in a new column.

回答1:

Thanks everyone for your inputs. I have solved my problem! The link that "agg3l" provided was helpful. The "TypeError" I saw was because either the "url_entrance" or "company_name" has some floating types in certain rows. I converted both columns to string using the following scripts, re-ran the fuzz.ratio script and got it to work!

df_combo['url_entrance']=df_combo['url_entrance'].astype(str) df_combo['company_name']=df_combo['company_name'].astype(str)