order a dataframe by column in Rcpp

2020-07-06 04:29发布

Is there any easy way to order a DataFrame by two (or more or one) of its columns within RCpp?

There are many sorting algorithms available on the net, or I can use std::sort with a wrapper for DataFrame, but I was wondering if there is something already available within either RCpp or RCppArmadillo?

I need to do this sorting / ordering as a part of another function

DataFrame myFunc(DataFrame myDF, NumericVector x) {
  //// some code here
  DataFrame myDFsorted = sort (myDF, someColName1, someColName2) // how to sort??
  //// some code here
}

I would like to avoid accessing R's order function within RCpp (for retaining speed of the RCpp code).

Many thanks

标签: r rcpp
2条回答
家丑人穷心不美
2楼-- · 2020-07-06 05:01

The difficulty is that a data frame is a set of vectors, potentially of different types; We need a way to order them independently of these types (integer, character, ...). In dplyr, we have developed what we call vector visitors. For this particular problem, what we need is a set of OrderVisitor, which exhibit the following interface:

class OrderVisitor {
public:
    virtual ~OrderVisitor(){}

    /** are the elements at indices i and j equal */
    virtual bool equal(int i, int j) const  = 0 ;

    /** is the i element less than the j element */
    virtual bool before( int i, int j) const = 0 ;

    virtual SEXP get() = 0 ;

} ;

dplyr then has implementations of OrderVisitor for all types we are supporting in this file and we have a dispatcher function order_visitor that makes an OrderVisitor* from a vector.

With this, we can store a set of vector visitors into a std::vector<OrderVisitor*>; The OrderVisitors has a constructor taking a DataFrame and a CharacterVector of names of vectors we want to use for the ordering.

OrderVisitors o(data, names ) ;

Then we can use the OrderVisitors.apply method which essentially does lexicographic ordering:

IntegerVector index = o.apply() ;

The apply method is implemented by simply initializing an IntegerVector with 0..n and then std::sort it according to the visitors.

inline Rcpp::IntegerVector OrderVisitors::apply() const {
    IntegerVector x = seq(0, nrows -1 ) ;
    std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ;
    return x ;
}

The relevant thing here is how the OrderVisitors_Compare class implements operator()(int,int) :

inline bool operator()(int i, int j) const {
    if( i == j ) return false ;
    for( int k=0; k<n; k++)
        if( ! obj.visitors[k]->equal(i,j) )
            return obj.visitors[k]->before(i, j ) ; 
    return i < j ;
}

So at this point index gives us the integer indices of the sorted data, we just have to make a new DataFrame from data by subsetting data with these indices. For this we have another kind of visitors, encapsulated in the DataFrameVisitors class. We first create a DataFrameVisitors :

DataFrameVisitors visitors( data ) ;

This encapsulates a std::vector<VectorVisitor*>. Each of these VectorVisitor* knows how to subset itself with an integer vector index. This is used from DataFrameVisitors.subset:

template <typename Container>
DataFrame subset( const Container& index, const CharacterVector& classes ) const {
    List out(nvisitors);
    for( int k=0; k<nvisitors; k++){
       out[k] = get(k)->subset(index) ;    
    }
    structure( out, Rf_length(out[0]) , classes) ;
    return (SEXP)out ;
}

To wrap this up, here is a simple function using tools developped in dplyr:

#include <dplyr.h>
// [[Rcpp::depends(dplyr)]]

using namespace Rcpp ;
using namespace dplyr ;

// [[Rcpp::export]]
DataFrame myFunc(DataFrame data, CharacterVector names) {
  OrderVisitors o(data, names ) ;
  IntegerVector index = o.apply() ;

  DataFrameVisitors visitors( data ) ;
  DataFrame res = visitors.subset(index, "data.frame" ) ;
  return res ;  
}
查看更多
太酷不给撩
3楼-- · 2020-07-06 05:14

Because a data.frame is really a list of columns at the C++, you would have to re-order all your columns individually given a new ording index. This is different from how [.., ..] indexing works in R for a data.frame.

See e.g. this Rcpp Gallery article on sorting vectors for some pointers. You will probably have to supply the new ordering index to be used, after which it is just an indexing question -- and that too has some posts on the Gallery.

This SO post may get you started on the index creation; this bytes.com post discusses the same idea.

Edit: And Armadillo has function sort_index() and stable_sort_index() to create the index you need for re-arranging your columns. This only covers the one column case, and is limited to numerical columns, but is a start.

查看更多
登录 后发表回答