Pandas left merging 'Date' keys with diffe

Hello Stack Overflow community, I am having an issue where Pandas is not understanding my merge conditions. It works with the other 'keys', but breaks as soon as I include the "Date" column as a key. The "Date" columns are string objects in both dataframes (not timestamps).

In other words, I want all 4 'keys' to be identical before "left merging" the columns from df2 to df without losing any data in df. Also, when I open the csv files in Excel, the dates format look exactly the same (ex: 5/10/2015).

But, Pandas reads the date column in "csv_file1", [df], as "5-6-2015" :

In [1]: df['Date']
Out[1]: 
         Date 
0   2015-5-11    
1   2015-5-11    
2   2015-5-10   
3   2015-5-12

Pandas reads the date column in "csv_file2", [df2], as "5/6/2015" :

In [2]: df2['Date']
Out[2]: 
         Date 
0   5/11/2015    
1   5/11/2015    
2   5/12/2015 
3   5/13/2015
4   5/17/2015

The dtypes for both are "obj"; I do not understand why Pandas would read the format of the 'Date' columns differently.

Here is what the dataframes look like before the left-merge:

In [3]: df
Out[3]: 
         Date Hour    Make   Model  Gas Rating  Safety Rating
0   2015-5-11    1   Honda   Accord         9             8
1   2015-5-11    0   Toyota  Camry          9            10
2   2015-5-10   23   Chevy   Sonic          7             6
3   2015-5-12   13   Honda   Civic          8             7

In [4]: df2
Out[4]: 
         Date Hour    Make   Model  Mileage  Rating  Speed Rating
0   5/11/2015    1   Honda   Accord             10            7
1   5/11/2015    0   Toyota   Camry             10            7
2   5/12/2015   23   Honda    Civic              9            6
3   5/13/2015   23   Honda    Civic              9            6
4   5/17/2015    7   Chevy   Impala

This is what happens when I try to left-merge:

In [5]: final = pd.merge(left=df, right=df2, how='left', on=['Date', 'Hour', 'Make', 'Model'])


In [6]: final
Out[6]: 
            Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
   0   2015-5-11    1  Honda   Accord         9             8           NaN   
   1   2015-5-11    0  Toyota  Camry          9            10           NaN     
   2   2015-5-10   23  Chevy   Sonic          7             6           NaN   
   3   2015-5-12   13  Honda   Civic          8             7           NaN   


     Speed Rating  
   0          NaN  
   1          NaN  
   2          NaN  
   3          NaN

If I on try merging without the 'Date' key, The data transfers correctly for the most part, but this is an excess of data due to duplicates in both and will not be accurate because I only need data where all four keys ('Date', 'Hour', 'Make', 'Model') match and anything from df before left merging the data.

There will always be many more duplicates of Make/Model & Hour in df2 so I only want to left merge matches to df, no matter how many duplicate instances within df. I also do not wish to lose any data in df so any dates from df that is not found in df2, should remain.

If the 'Date' merge condition worked, this is the output I am trying to achieve:

In [7]: final
Out[7]: 
                Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
       0   5/11/2015    1  Honda   Accord         9             8            10   
       1   5/11/2015    0  Toyota  Camry          9            10            10     
       2   5/10/2015   23  Chevy   Sonic          7             6           NaN   
       3   5/12/2015   13  Honda   Civic          8             7             8   


          Speed Rating  
       0            7  
       1            7  
       2          NaN  
       3            7

Does anyone have an idea why this is happening? I have tried even splicing the 'Date' column into 3 columns ('Month', 'Day', 'Year') and changing the dtype to int64, bool, obj and no success there either. So I assume it has something to do with the format.

Thanks ahead of time Stack Overflow community!

标签： python-2.7 pandas merge key dataframe

1条回答

啃猪蹄的小仙女

2楼-- · 2019-08-31 08:24

Running the below code before the merge should put the dates into a common format so that the merge works properly.

import time

df['Date']=time.strftime('%Y-%m-%d',time.strptime(df['date'],'%m/%d/%Y'))
df2['Date']=time.strftime('%Y-%m-%d',time.strptime(df2['date'],'%Y-%m-%d'))

It would have been nice to simply change one of the dates, but the python time library adds a leading 0 to the month and date with the %m and %d tags. The %-m and %-d tags would not add the leading 0s, but they don't work across all systems. See here for more information on that oddity.

0人赞添加讨论(0) 举报

Pandas left merging 'Date' keys with diffe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间