Using Python 3.x, I have a list of strings for which I would like to perform a natural alphabetical sort.
Natural sort: The order by which files in Windows are sorted.
For instance, the following list is naturally sorted (what I want):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
And here's the "sorted" version of the above list (what I have):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
I'm looking for a sort function which behaves like the first one.
Most likely
functools.cmp_to_key()
is closely tied to the underlying implementation of python's sort. Besides, the cmp parameter is legacy. The modern way is to transform the input items into objects that support the desired rich comparison operations.Under CPython 2.x, objects of disparate types can be ordered even if the respective rich comparison operators haven't been implemented. Under CPython 3.x, objects of different types must explicitly support the comparison. See How does Python compare string and int? which links to the official documentation. Most of the answers depend on this implicit ordering. Switching to Python 3.x will require a new type to implement and unify comparisons between numbers and strings.
There are three different approaches. The first uses nested classes to take advantage of Python's
Iterable
comparison algorithm. The second unrolls this nesting into a single class. The third foregoes subclassingstr
to focus on performance. All are timed; the second is twice as fast while the third almost six times faster. Subclassingstr
isn't required, and was probably a bad idea in the first place, but it does come with certain conveniences.The sort characters are duplicated to force ordering by case, and case-swapped to force lower case letter to sort first; this is the typical definition of "natural sort". I couldn't decide on the type of grouping; some might prefer the following, which also brings significant performance benefits:
Where utilized, the comparison operators are set to that of
object
so they won't be ignored byfunctools.total_ordering
.Natural sorting is both pretty complicated and vaguely defined as a problem. Don't forget to run
unicodedata.normalize(...)
beforehand, and consider usestr.casefold()
rather thanstr.lower()
. There are probably subtle encoding issues I haven't considered. So I tentatively recommend the natsort library. I took a quick glance at the github repository; the code maintenance has been stellar.All the algorithms I've seen depend on tricks such as duplicating and lowering characters, and swapping case. While this doubles the running time, an alternative would require a total natural ordering on the input character set. I don't think this is part of the unicode specification, and since there are many more unicode digits than
[0-9]
, creating such a sorting would be equally daunting. If you want locale-aware comparisons, prepare your strings withlocale.strxfrm
per Python's Sorting HOW TO.Here's a much more pythonic version of Mark Byer's answer:
Now this function can be used as a key in any function that uses it, like
list.sort
,sorted
,max
, etc.As a lambda: