I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z
(descending) then by column b
(ascending):
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
dd
b x y z
1 Hi A 8 1
2 Med D 3 1
3 Hi A 9 1
4 Low C 9 2
The R package
data.table
provides both fast and memory efficient ordering of data.tables with a straightforward syntax (a part of which Matt has highlighted quite nicely in his answer). There has been quite a lot of improvements and also a new functionsetorder()
since then. Fromv1.9.5+
,setorder()
also works with data.frames.First, we'll create a dataset big enough and benchmark the different methods mentioned from other answers and then list the features of data.table.
Data:
Benchmarks:
The timings reported are from running
system.time(...)
on these functions shown below. The timings are tabulated below (in the order of slowest to fastest).data.table
'sDT[order(...)]
syntax was ~10x faster than the fastest of other methods (dplyr
), while consuming the same amount of memory asdplyr
.data.table
'ssetorder()
was ~14x faster than the fastest of other methods (dplyr
), while taking just 0.4GB extra memory.dat
is now in the order we require (as it is updated by reference).data.table features:
Speed:
data.table's ordering is extremely fast because it implements radix ordering.
The syntax
DT[order(...)]
is optimised internally to use data.table's fast ordering as well. You can keep using the familiar base R syntax but speed up the process (and use less memory).Memory:
Most of the times, we don't require the original data.frame or data.table after reordering. That is, we usually assign the result back to the same object, for example:
The issue is that this requires at least twice (2x) the memory of the original object. To be memory efficient, data.table therefore also provides a function
setorder()
.setorder()
reorders data.tablesby reference
(in-place), without making any additional copies. It only uses extra memory equal to the size of one column.Other features:
It supports
integer
,logical
,numeric
,character
and evenbit64::integer64
types.In base R, we can not use
-
on a character vector to sort by that column in decreasing order. Instead we have to use-xtfrm(.)
.However, in data.table, we can just do, for example,
dat[order(-x)]
orsetorder(dat, -x)
.Another alternative, using the
rgr
package:You can use the
order()
function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of theexample(order)
code:Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the
order()
function:rather than using the name of the column (and
with()
for easier/more direct access).Just like the mechanical card sorters of long ago, first sort by the least significant key, then the next most significant, etc. No library required, works with any number of keys and any combination of ascending and descending keys.
Now we're ready to do the most significant key. The sort is stable, and any ties in the most significant key have already been resolved.
This may not be the fastest, but it is certainly simple and reliable
For the sake of completeness: you can also use the
sortByCol()
function from theBBmisc
package:Performance comparison:
Alternatively, using the package Deducer