Now that pandas
has support for key
in both sort_values
and sort_index
you should now refer to this other answer and send all upvotes there as it is now the correct answer.
I will leave my answer here for people stuck on old pandas
versions, or as a historical curiosity.
The accepted answer answers the question being asked. I'd like to also add how to use natsort
on columns in a DataFrame
, since that will be the next question asked.
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted, index_natsorted, order_by_index
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df
Out[4]:
a b
0hr a5 b1
128hr a1 b1
72hr a10 b2
48hr a2 b2
96hr a12 b1
As the accepted answer shows, sorting by the index is fairly straightforward:
In [5]: df.reindex(index=natsorted(df.index))
Out[5]:
a b
0hr a5 b1
48hr a2 b2
72hr a10 b2
96hr a12 b1
128hr a1 b1
If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsort
provides the convenience functions index_natsorted
and order_by_index
to do just that.
In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip
(or itertools.izip
on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...
In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]:
a b
0hr a5 b1
96hr a12 b1
128hr a1 b1
48hr a2 b2
72hr a10 b2
Here is an alternate method using Categorical
objects that I have been told by the pandas
devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex
which will allow this method to be used on an index.
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df.a = df.a.astype('category')
In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)
In [6]: df.b = df.b.astype('category')
In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)
In [9]: df.sort('a')
Out[9]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [10]: df.sort('b')
Out[10]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
In [11]: df.sort(['b', 'a'])
Out[11]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
The Categorical
object lets you define a sorting order for the DataFrame
to use. The elements given when calling reorder_categories
must be unique, hence the call to set
for column "b".
I leave it to the user to decide if this is better than the reindex
method or not, since it requires you to sort the column data independently before sorting within the DataFrame
(although I imagine that second sort is rather efficient).
Full disclosure, I am the natsort
author.