How to sort a dataframe by multiple column(s)?

2018-12-30 23:23发布

I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2

18条回答
君临天下
2楼-- · 2018-12-30 23:57

The R package data.table provides both fast and memory efficient ordering of data.tables with a straightforward syntax (a part of which Matt has highlighted quite nicely in his answer). There has been quite a lot of improvements and also a new function setorder() since then. From v1.9.5+, setorder() also works with data.frames.

First, we'll create a dataset big enough and benchmark the different methods mentioned from other answers and then list the features of data.table.

Data:

require(plyr)
require(doBy)
require(data.table)
require(dplyr)
require(taRifx)

set.seed(45L)
dat = data.frame(b = as.factor(sample(c("Hi", "Med", "Low"), 1e8, TRUE)),
                 x = sample(c("A", "D", "C"), 1e8, TRUE),
                 y = sample(100, 1e8, TRUE),
                 z = sample(5, 1e8, TRUE), 
                 stringsAsFactors = FALSE)

Benchmarks:

The timings reported are from running system.time(...) on these functions shown below. The timings are tabulated below (in the order of slowest to fastest).

orderBy( ~ -z + b, data = dat)     ## doBy
plyr::arrange(dat, desc(z), b)     ## plyr
arrange(dat, desc(z), b)           ## dplyr
sort(dat, f = ~ -z + b)            ## taRifx
dat[with(dat, order(-z, b)), ]     ## base R

# convert to data.table, by reference
setDT(dat)

dat[order(-z, b)]                  ## data.table, base R like syntax
setorder(dat, -z, b)               ## data.table, using setorder()
                                   ## setorder() now also works with data.frames 

# R-session memory usage (BEFORE) = ~2GB (size of 'dat')
# ------------------------------------------------------------
# Package      function    Time (s)  Peak memory   Memory used
# ------------------------------------------------------------
# doBy          orderBy      409.7        6.7 GB        4.7 GB
# taRifx           sort      400.8        6.7 GB        4.7 GB
# plyr          arrange      318.8        5.6 GB        3.6 GB 
# base R          order      299.0        5.6 GB        3.6 GB
# dplyr         arrange       62.7        4.2 GB        2.2 GB
# ------------------------------------------------------------
# data.table      order        6.2        4.2 GB        2.2 GB
# data.table   setorder        4.5        2.4 GB        0.4 GB
# ------------------------------------------------------------
  • data.table's DT[order(...)] syntax was ~10x faster than the fastest of other methods (dplyr), while consuming the same amount of memory as dplyr.

  • data.table's setorder() was ~14x faster than the fastest of other methods (dplyr), while taking just 0.4GB extra memory. dat is now in the order we require (as it is updated by reference).

data.table features:

Speed:

  • data.table's ordering is extremely fast because it implements radix ordering.

  • The syntax DT[order(...)] is optimised internally to use data.table's fast ordering as well. You can keep using the familiar base R syntax but speed up the process (and use less memory).

Memory:

  • Most of the times, we don't require the original data.frame or data.table after reordering. That is, we usually assign the result back to the same object, for example:

    DF <- DF[order(...)]
    

    The issue is that this requires at least twice (2x) the memory of the original object. To be memory efficient, data.table therefore also provides a function setorder().

    setorder() reorders data.tables by reference (in-place), without making any additional copies. It only uses extra memory equal to the size of one column.

Other features:

  1. It supports integer, logical, numeric, character and even bit64::integer64 types.

    Note that factor, Date, POSIXct etc.. classes are all integer/numeric types underneath with additional attributes and are therefore supported as well.

  2. In base R, we can not use - on a character vector to sort by that column in decreasing order. Instead we have to use -xtfrm(.).

    However, in data.table, we can just do, for example, dat[order(-x)] or setorder(dat, -x).

查看更多
无与为乐者.
3楼-- · 2018-12-30 23:59

Another alternative, using the rgr package:

> library(rgr)
> gx.sort.df(dd, ~ -z+b)
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
查看更多
梦寄多情
4楼-- · 2018-12-31 00:02

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

R> dd[with(dd, order(-z, b)), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function:

R> dd[order(-dd[,4], dd[,1]), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
R> 

rather than using the name of the column (and with() for easier/more direct access).

查看更多
栀子花@的思念
5楼-- · 2018-12-31 00:02

Just like the mechanical card sorters of long ago, first sort by the least significant key, then the next most significant, etc. No library required, works with any number of keys and any combination of ascending and descending keys.

 dd <- dd[order(dd$b, decreasing = FALSE),]

Now we're ready to do the most significant key. The sort is stable, and any ties in the most significant key have already been resolved.

dd <- dd[order(dd$z, decreasing = TRUE),]

This may not be the fastest, but it is certainly simple and reliable

查看更多
美炸的是我
6楼-- · 2018-12-31 00:03

For the sake of completeness: you can also use the sortByCol() function from the BBmisc package:

library(BBmisc)
sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE))
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Performance comparison:

library(microbenchmark)
microbenchmark(sortByCol(dd, c("z", "b"), asc = c(FALSE, TRUE)), times = 100000)
median 202.878

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=100000)
median 148.758

microbenchmark(dd[with(dd, order(-z, b)), ], times = 100000)
median 115.872
查看更多
千与千寻千般痛.
7楼-- · 2018-12-31 00:04

Alternatively, using the package Deducer

library(Deducer)
dd<- sortData(dd,c("z","b"),increasing= c(FALSE,TRUE))
查看更多
登录 后发表回答