Merge two dataframe in pandas

2020-02-01 16:29发布

I am merging two csv(data frame) using below code:

import pandas as pd
a = pd.read_csv(file1,dtype={'student_id': str})
df = pd.read_csv(file2)
c=pd.merge(a,df,on='test_id',how='left')
c.to_csv('test1.csv', index=False)

I have the following CSV files

file1:

test_id, student_id
1, 01990
2, 02300
3, 05555

file2:

test_id, result
1, pass
3, fail

after merge

test_id, student_id , result
1, 1990, pass
2, 2300,
3, 5555, fail

If you notice student_id has 0 appended at the beginning and it's supposed to be considered as text but after merging and using to_csv function it converts it into numeric and removes leading 0.

How can I keep the column as "text" even after to_csv?

I think its to_csv function which saves back again as numeric Added dtype={'student_id': str} while reading csv.. but while saving it as to_csv .. it again convert it to numeric

4条回答
Juvenile、少年°
2楼-- · 2020-02-01 16:44

It's not dropping the leading zero on the merge, it's dropping it on the read_csv. You can fix this by specifying that column is a string at import time:

a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)

The important part is the dtype parameter. You are telling pandas to import this column as a string. The skipinitialspace parameter is set to True, because the column headers are defined with spaces, so we strip it:

test_id, student_id
        ^ The student_id starts here, at the space

The final code looks like this:

a = pd.read_csv('file1.csv', dtype={'student_id': str}, skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = a.merge(df, how='left', on='test_id')

With the results dataframe looking like this:

    test_id     student_id  result
0   1           01990       pass
1   2           02300       NaN
2   3           05555       fail

Then when you run to_csv your result should be:

test_id,student_id, result
1,01990, pass
2,02300,
3,05555, fail
查看更多
家丑人穷心不美
3楼-- · 2020-02-01 16:55

Solution with join, first need read_csv with parameter dtype for convert student_id to string and remove whitespaces by skipinitialspace:

df1 = pd.read_csv(file1, dtype={'student_id': str}, skipinitialspace=True)
df2 = pd.read_csv(file2, skipinitialspace=True)

df = df1.join(df2.set_index('test_id'), on='test_id')
print (df)
   test_id student_id  result
0        1      01990    pass
1        2      02300     NaN
2        3      05555    fail
查看更多
太酷不给撩
4楼-- · 2020-02-01 17:01
a = pd.read_csv(file1, dtype={'test_id': object})
df = pd.read_csv(file2, dtype={'test_id': object})

==============================================================

In[28]: pd.merge(a, b, on='test_id', how='left')
Out[28]: 
  test_id   student_id  result
0      01         1990    pass
1      02         2300     NaN
2     003         5555    fail
查看更多
Evening l夕情丶
5楼-- · 2020-02-01 17:03

Caveat Please use merge or join. This answer is provided to give perspective on the flexibility pandas gives you and how many different ways there are to answer the same question.

a = pd.read_csv('file1.csv', converters=dict(student_id=str), skipinitialspace=True)
df = pd.read_csv('file2.csv')
results = pd.concat(
    [d.set_index('test_id') for d in [a, df]],
    axis=1, join='outer'
).reset_index()
查看更多
登录 后发表回答