I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?
from natsort import natsorted
import pandas as pd
# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted
c = natsorted(a)
# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)
print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)
If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:
Note that
df.index = natsorted(df.index)
also worksif you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:
EDIT
If you want to sort the index so that the data is reordered along with the index then use
reindex
:Note that you have to assign the result of
reindex
to either a new df or to itself, it does not accept theinplace
param.The accepted answer answers the question being asked. I'd like to also add how to use
natsort
on columns in aDataFrame
, since that will be the next question asked.As the accepted answer shows, sorting by the index is fairly straightforward:
If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered.
natsort
provides the convenience functionsindex_natsorted
andorder_by_index
to do just that.If you want to reorder by an arbitrary number of columns (or a column and the index), you can use
zip
(oritertools.izip
on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...Here is an alternate method using
Categorical
objects that I have been told by thepandas
devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will addCategoricalIndex
which will allow this method to be used on an index.The
Categorical
object lets you define a sorting order for theDataFrame
to use. The elements given when callingreorder_categories
must be unique, hence the call toset
for column "b".I leave it to the user to decide if this is better than the
reindex
method or not, since it requires you to sort the column data independently before sorting within theDataFrame
(although I imagine that second sort is rather efficient).Full disclosure, I am the
natsort
author.