Pandas left merging 'Date' keys with diffe

2019-08-31 07:34发布

问题:

Hello Stack Overflow community, I am having an issue where Pandas is not understanding my merge conditions. It works with the other 'keys', but breaks as soon as I include the "Date" column as a key. The "Date" columns are string objects in both dataframes (not timestamps).

In other words, I want all 4 'keys' to be identical before "left merging" the columns from df2 to df without losing any data in df. Also, when I open the csv files in Excel, the dates format look exactly the same (ex: 5/10/2015).

But, Pandas reads the date column in "csv_file1", [df], as "5-6-2015" :

In [1]: df['Date']
Out[1]: 
         Date 
0   2015-5-11    
1   2015-5-11    
2   2015-5-10   
3   2015-5-12  

Pandas reads the date column in "csv_file2", [df2], as "5/6/2015" :

In [2]: df2['Date']
Out[2]: 
         Date 
0   5/11/2015    
1   5/11/2015    
2   5/12/2015 
3   5/13/2015
4   5/17/2015 

The dtypes for both are "obj"; I do not understand why Pandas would read the format of the 'Date' columns differently.

Here is what the dataframes look like before the left-merge:

In [3]: df
Out[3]: 
         Date Hour    Make   Model  Gas Rating  Safety Rating
0   2015-5-11    1   Honda   Accord         9             8
1   2015-5-11    0   Toyota  Camry          9            10
2   2015-5-10   23   Chevy   Sonic          7             6
3   2015-5-12   13   Honda   Civic          8             7

In [4]: df2
Out[4]: 
         Date Hour    Make   Model  Mileage  Rating  Speed Rating
0   5/11/2015    1   Honda   Accord             10            7
1   5/11/2015    0   Toyota   Camry             10            7
2   5/12/2015   23   Honda    Civic              9            6
3   5/13/2015   23   Honda    Civic              9            6
4   5/17/2015    7   Chevy   Impala                

This is what happens when I try to left-merge:

In [5]: final = pd.merge(left=df, right=df2, how='left', on=['Date', 'Hour', 'Make', 'Model'])


In [6]: final
Out[6]: 
            Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
   0   2015-5-11    1  Honda   Accord         9             8           NaN   
   1   2015-5-11    0  Toyota  Camry          9            10           NaN     
   2   2015-5-10   23  Chevy   Sonic          7             6           NaN   
   3   2015-5-12   13  Honda   Civic          8             7           NaN   


     Speed Rating  
   0          NaN  
   1          NaN  
   2          NaN  
   3          NaN    

If I on try merging without the 'Date' key, The data transfers correctly for the most part, but this is an excess of data due to duplicates in both and will not be accurate because I only need data where all four keys ('Date', 'Hour', 'Make', 'Model') match and anything from df before left merging the data.

There will always be many more duplicates of Make/Model & Hour in df2 so I only want to left merge matches to df, no matter how many duplicate instances within df. I also do not wish to lose any data in df so any dates from df that is not found in df2, should remain.

If the 'Date' merge condition worked, this is the output I am trying to achieve:

In [7]: final
Out[7]: 
                Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
       0   5/11/2015    1  Honda   Accord         9             8            10   
       1   5/11/2015    0  Toyota  Camry          9            10            10     
       2   5/10/2015   23  Chevy   Sonic          7             6           NaN   
       3   5/12/2015   13  Honda   Civic          8             7             8   


          Speed Rating  
       0            7  
       1            7  
       2          NaN  
       3            7 

Does anyone have an idea why this is happening? I have tried even splicing the 'Date' column into 3 columns ('Month', 'Day', 'Year') and changing the dtype to int64, bool, obj and no success there either. So I assume it has something to do with the format.

Thanks ahead of time Stack Overflow community!

回答1:

Running the below code before the merge should put the dates into a common format so that the merge works properly.

import time

df['Date']=time.strftime('%Y-%m-%d',time.strptime(df['date'],'%m/%d/%Y'))
df2['Date']=time.strftime('%Y-%m-%d',time.strptime(df2['date'],'%Y-%m-%d'))

It would have been nice to simply change one of the dates, but the python time library adds a leading 0 to the month and date with the %m and %d tags. The %-m and %-d tags would not add the leading 0s, but they don't work across all systems. See here for more information on that oddity.