Hello Stack Overflow community, I am having an issue where Pandas is not understanding my merge conditions. It works with the other 'keys', but breaks as soon as I include the "Date" column as a key. The "Date" columns are string objects in both dataframes (not timestamps).
In other words, I want all 4 'keys' to be identical before "left merging" the columns from df2 to df without losing any data in df. Also, when I open the csv files in Excel, the dates format look exactly the same (ex: 5/10/2015).
But, Pandas reads the date column in "csv_file1", [df], as "5-6-2015" :
In [1]: df['Date']
Out[1]:
Date
0 2015-5-11
1 2015-5-11
2 2015-5-10
3 2015-5-12
Pandas reads the date column in "csv_file2", [df2], as "5/6/2015" :
In [2]: df2['Date']
Out[2]:
Date
0 5/11/2015
1 5/11/2015
2 5/12/2015
3 5/13/2015
4 5/17/2015
The dtypes for both are "obj"; I do not understand why Pandas would read the format of the 'Date' columns differently.
Here is what the dataframes look like before the left-merge:
In [3]: df
Out[3]:
Date Hour Make Model Gas Rating Safety Rating
0 2015-5-11 1 Honda Accord 9 8
1 2015-5-11 0 Toyota Camry 9 10
2 2015-5-10 23 Chevy Sonic 7 6
3 2015-5-12 13 Honda Civic 8 7
In [4]: df2
Out[4]:
Date Hour Make Model Mileage Rating Speed Rating
0 5/11/2015 1 Honda Accord 10 7
1 5/11/2015 0 Toyota Camry 10 7
2 5/12/2015 23 Honda Civic 9 6
3 5/13/2015 23 Honda Civic 9 6
4 5/17/2015 7 Chevy Impala
This is what happens when I try to left-merge:
In [5]: final = pd.merge(left=df, right=df2, how='left', on=['Date', 'Hour', 'Make', 'Model'])
In [6]: final
Out[6]:
Date Hour Make Model Gas Rating Safety Rating Mileage Rating \
0 2015-5-11 1 Honda Accord 9 8 NaN
1 2015-5-11 0 Toyota Camry 9 10 NaN
2 2015-5-10 23 Chevy Sonic 7 6 NaN
3 2015-5-12 13 Honda Civic 8 7 NaN
Speed Rating
0 NaN
1 NaN
2 NaN
3 NaN
If I on try merging without the 'Date' key, The data transfers correctly for the most part, but this is an excess of data due to duplicates in both and will not be accurate because I only need data where all four keys ('Date', 'Hour', 'Make', 'Model') match and anything from df before left merging the data.
There will always be many more duplicates of Make/Model & Hour in df2 so I only want to left merge matches to df, no matter how many duplicate instances within df. I also do not wish to lose any data in df so any dates from df that is not found in df2, should remain.
If the 'Date' merge condition worked, this is the output I am trying to achieve:
In [7]: final
Out[7]:
Date Hour Make Model Gas Rating Safety Rating Mileage Rating \
0 5/11/2015 1 Honda Accord 9 8 10
1 5/11/2015 0 Toyota Camry 9 10 10
2 5/10/2015 23 Chevy Sonic 7 6 NaN
3 5/12/2015 13 Honda Civic 8 7 8
Speed Rating
0 7
1 7
2 NaN
3 7
Does anyone have an idea why this is happening? I have tried even splicing the 'Date' column into 3 columns ('Month', 'Day', 'Year') and changing the dtype to int64, bool, obj and no success there either. So I assume it has something to do with the format.
Thanks ahead of time Stack Overflow community!
Running the below code before the merge should put the dates into a common format so that the merge works properly.
It would have been nice to simply change one of the dates, but the python time library adds a leading 0 to the month and date with the
%m
and%d
tags. The%-m
and%-d
tags would not add the leading 0s, but they don't work across all systems. See here for more information on that oddity.