I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z
(descending) then by column b
(ascending):
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
dd
b x y z
1 Hi A 8 1
2 Med D 3 1
3 Hi A 9 1
4 Low C 9 2
There are a lot of excellent answers here, but dplyr gives the only syntax that I can quickly and easily remember (and so now use very often):
For the OP's problem:
In response to a comment added in the OP for how to sort programmatically:
Using
dplyr
anddata.table
dplyr
Just use
arrange_
, which is the Standard Evaluation version forarrange
.more info here: https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
It is better to use formula as it also captures the environment to evaluate an expression in
data.table
Your choices
order
frombase
arrange
fromdplyr
setorder
andsetorderv
fromdata.table
arrange
fromplyr
sort
fromtaRifx
orderBy
fromdoBy
sortData
fromDeducer
Most of the time you should use the
dplyr
ordata.table
solutions, unless having no-dependencies is important, in which case usebase::order
.I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here: Best way to create generic/method consistency for sort.data.frame?
Therefore, given the data.frame dd, you can sort as follows:
If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: http://chat.stackoverflow.com/transcript/message/1094290#1094290
You can also use the
arrange()
function fromplyr
as Hadley pointed out in the above thread:Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes
sort
to return "The following object(s) are masked from 'x (position 17)': b, x, y, z", and loading the Deducer package overwritessort.data.frame
from Kevin Wright or the taRifx package.Median times:
dd[with(dd, order(-z, b)), ]
778dd[order(-dd$z, dd$b),]
788Median time: 1,567
Median time: 862
Median time: 1,694
Note that doBy takes a good bit of time to load the package.
Couldn't make Deducer load. Needs JGR console.
Doesn't appear to be compatible with microbenchmark due to the attach/detach.
(lines extend from lower quartile to upper quartile, dot is the median)
Given these results and weighing simplicity vs. speed, I'd have to give the nod to
arrange
in theplyr
package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called bysort(object)
, but I understand why Hadley did it that way due to issues discussed in the question linked above.You can use the
order()
function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of theexample(order)
code:Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the
order()
function:rather than using the name of the column (and
with()
for easier/more direct access).The arrange() in dplyer is my favorite option. Use the pipe operator and go from least important to most important aspect
Just like the mechanical card sorters of long ago, first sort by the least significant key, then the next most significant, etc. No library required, works with any number of keys and any combination of ascending and descending keys.
Now we're ready to do the most significant key. The sort is stable, and any ties in the most significant key have already been resolved.
This may not be the fastest, but it is certainly simple and reliable