Is there any easy way to order a DataFrame by two (or more or one) of its columns within RCpp?
There are many sorting algorithms available on the net, or I can use std::sort
with a wrapper for DataFrame, but I was wondering if there is something already available within either RCpp or RCppArmadillo?
I need to do this sorting / ordering as a part of another function
DataFrame myFunc(DataFrame myDF, NumericVector x) {
//// some code here
DataFrame myDFsorted = sort (myDF, someColName1, someColName2) // how to sort??
//// some code here
}
I would like to avoid accessing R's order
function within RCpp (for retaining speed of the RCpp code).
Many thanks
The difficulty is that a data frame is a set of vectors, potentially of different types; We need a way to order them independently of these types (integer, character, ...). In dplyr, we have developed what we call vector visitors. For this particular problem, what we need is a set of
OrderVisitor
, which exhibit the following interface:dplyr then has implementations of
OrderVisitor
for all types we are supporting in this file and we have a dispatcher functionorder_visitor
that makes anOrderVisitor*
from a vector.With this, we can store a set of vector visitors into a
std::vector<OrderVisitor*>
; The OrderVisitors has a constructor taking aDataFrame
and aCharacterVector
of names of vectors we want to use for the ordering.Then we can use the
OrderVisitors.apply
method which essentially does lexicographic ordering:The
apply
method is implemented by simply initializing anIntegerVector
with0..n
and thenstd::sort
it according to the visitors.The relevant thing here is how the
OrderVisitors_Compare
class implementsoperator()(int,int)
:So at this point
index
gives us the integer indices of the sorted data, we just have to make a newDataFrame
fromdata
by subsettingdata
with these indices. For this we have another kind of visitors, encapsulated in theDataFrameVisitors
class. We first create aDataFrameVisitors
:This encapsulates a
std::vector<VectorVisitor*>
. Each of theseVectorVisitor*
knows how to subset itself with an integer vector index. This is used fromDataFrameVisitors.subset
:To wrap this up, here is a simple function using tools developped in dplyr:
Because a
data.frame
is really a list of columns at the C++, you would have to re-order all your columns individually given a new ording index. This is different from how[.., ..]
indexing works in R for adata.frame
.See e.g. this Rcpp Gallery article on sorting vectors for some pointers. You will probably have to supply the new ordering index to be used, after which it is just an indexing question -- and that too has some posts on the Gallery.
This SO post may get you started on the index creation; this bytes.com post discusses the same idea.
Edit: And Armadillo has function
sort_index()
andstable_sort_index()
to create the index you need for re-arranging your columns. This only covers the one column case, and is limited to numerical columns, but is a start.