I have two DataFrames in pandas, trying to merge them. But pandas keeps changing the order. I've tried setting indexes, resetting them, no matter what I do, I can't get the returned output to have the rows in the same order. Is there a trick? Note we start out with the loans order 'a,b,c' but after the merge, it's "a,c,b".
import pandas
loans = [ 'a', 'b', 'c' ]
states = [ 'OR', 'CA', 'OR' ]
x = pandas.DataFrame({ 'loan' : loans, 'state' : states })
y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]})
z = x.merge(y, how='left', on='state')
But now the order is no longer the original 'a,b,c'. Any ideas? I'm using pandas version 11.
The fastest way I've found to merge and restore order - if you are merging "left" - is to include the original order as a column in the left dataframe before merging, then use that to restore the order after merging:
This method is faster than sorting. Here it is as a function:
Use
pd.merge_ordered()
, documentation here.For your example,
z = pd.merge_ordered(x, y, how='left', on='state')
EDIT: Just wanted to point out that default behavior for this function is an outer merge, different from the default behavior of the more common
.merge()
I might have a much more simple solution:
Hope it helps
Hopefully someone will provide a better answer, but in case no one does, this will definitely work, so…
Zeroth, I'm assuming you don't want to just end up sorted on
loan
, but to preserve whatever original order was inx
, which may or may not have anything to do with the order of theloan
column. (Otherwise, the problem is easier, and less interesting.)First, you're asking it to sort based on the join keys. As the docs explain, that's the default when you don't pass a
sort
argument.Second, if you don't sort based on the join keys, the rows will end up grouped together, such that two rows that merged from the same source row end up next to each other, which means you're still going to get
a
,c
,b
.You can work around this by getting the rows grouped together in the order they appear in the original
x
by just merging again withx
(on either side, it doesn't really matter), or by reindexing based onx
if you prefer. Like this:Alternatively, you can cram an x-index in there with
reset_index
, then just sort on that, like this:Either way obviously seems a bit wasteful, and clumsy… so, as I said, hopefully there's a better answer that I'm just not seeing at the moment. But if not, that works.
Pandas v0.8.0 introduced new merge function that takes order into consideration -
ordered_merge
, so your solution is now as simple as: