Python Pandas compare two dataframes to assign cou

I have two dataframes that I read in via csv. Dataframe one consists of a phone number and some additional data. The second dataframe contains country codes and country names.

I want to take the phone number from the first dataset and compare it to the country codes of the second. Country codes can between one to four digits long. I go from the longest country code to the shortest. If there is a match, I want to assign the country name to the phone number.

Input longlist:

phonenumber, add_info    
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5

Input country_list:

country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2

Output should be:

phonenumber, add_info, country, order_info    
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4

I have it solved once like this:

#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
    with open('Outputfile.csv', 'w') as outputlist:
        reader = csv.reader(lookuplist, delimiter=',')
        reader2 = csv.reader(inputlist, delimiter=';')
        writer = csv.writer(outputlist, dialect='excel')

        for i in reader2:
            for xl in reader:
                if xl[0].startswith(i[1]):
                    zeile = [xl[0], xl[1], i[0], i[1], i[2]]
                    writer.writerow(zeile)
            lookuplist.seek(0)

But I would like to solve this problem, using pandas. What I got to work: - Read in the csv files - Remove duplicates from "longlist" - Sort list of countries / country code

This is, what I have working already:

import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv', 
                                 usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv', 
                           sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])

# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')

# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')

(...)

longlist.to_csv('path/to/output.csv')

But any way trying the same with datasets does not work. I cannot apply startswith (cannot iterate through objects and cannot apply it on objects). I would really appreciate your help.

标签： python pandas dataframe startswith

1条回答

萌系小妹纸

2楼-- · 2019-07-12 02:23

i would do it this way:

cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})

lookup = cl['country_code']
lookup.index = cl['country_code']

ll['country_code'] = (
    ll['phonenumber']
    .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
                                lookup.get(x[:2]), lookup.get(x[:1])]))
    .apply(lambda x: x.get(x.first_valid_index()), axis=1)
)

# remove `how='left'` parameter if you don't need "unmatched" phone-numbers    
result = ll.merge(cl, on='country_code', how='left')

Output:

In [195]: result
Out[195]:
    phonenumber add_info country_code      country  order_info
0   34123425209    info1           34        Spain         1.0
1   92654321762    info2           92     Pakistan         4.0
2   12018883637    info3            1          USA         2.0
3   12428883637   info31         1242      Bahamas         3.0
4    6323450001    info4           63  Philippines         3.0
5  496789521134    info5           49      Germany         4.0
6   00000000000      BAD         None          NaN         NaN

Explanation:

In [216]: (ll['phonenumber']
   .....:   .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
   .....:                               lookup.get(x[:2]), lookup.get(x[:1])]))
   .....: )
Out[216]:
      0     1     2     3
0  None  None    34  None
1  None  None    92  None
2  None  None  None     1
3  1242  None  None     1
4  None  None    63  None
5  None  None    49  None
6  None  None  None  None

phones.csv: - i've intentionally added one Bahamas number (1242...) and one invalid number (00000000000)

phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
12428883637, info31
6323450001, info4
496789521134, info5
00000000000, BAD

0人赞添加讨论(0) 举报

Python Pandas compare two dataframes to assign cou

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间