Finding duplicates using Rcpp

2019-07-27 14:16发布

I'm trying to find a speedier replacement for finding duplicates in R. The intent of the code is to pass the matrix to Rcpp with a row number from that matrix, then loop through the entire matrix looking for a match for that row. The matrix in question is a Logical matrix with 1000 rows and 250 cols.

Sounds simple, but the code below is not detecting equivalent vector rows. I'm not sure if it's an issue with the equal() function or something in how the matrix or vectors are defined.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins]]
#include <cstddef>   // std:size_t
#include <iterator>  // std:begin, std::end
#include <vector>    // std::vector
#include <iostream>
#include <string>

// [[Rcpp::export]]
    bool dupCheckRcpp (int nVector, 
                        LogicalMatrix bigMatrix) {
    // initialize
      int i, j, nrow, ncol;
      nrow = bigMatrix.nrow();
      ncol = bigMatrix.ncol();
      LogicalVector vec(ncol);  // holds vector of interest
      LogicalVector vecMatrix(ncol); // temp vector for loop through bigMatrix
      nVector = nVector - 1;

    // copy bigMatrix data into vec based on nVector row
      for ( j = 0; j < ncol; ++j ) {
        vec(j) = bigMatrix(nVector,j);
      }

    // check loop: check vecTeam against each row in allMatrix
      for (i = 0; i < nrow; ++i) {  
        // copy bigMatrix data into vecMatrix
          for ( j = 0; j < ncol; ++j ) {
            vecMatrix(j) = bigMatrix(i,j);
          }
        // check for equality
          if (i != nVector) {  // skip if nVector row
            // compare vecTeam to vecMatrix
              if (std::equal(vec.begin(),vec.end(),vecMatrix.begin())) {
              return true;
            }
          }
      } // close check loop
      return false;
    }

1条回答
放荡不羁爱自由
2楼-- · 2019-07-27 14:40

I'm not exactly sure where the mistake lies in your code, but note that you really shouldn't ever need to manually copy elements between Rcpp types like this:

// copy bigMatrix data into vec based on nVector row
for (j = 0; j < ncol; ++j) {
    vec(j) = bigMatrix(nVector, j);
}

There is almost always going to be a suitable class and / or appropriate assignment operator, etc. which allows you to accomplish this more succinctly and more safely (i.e. less prone to programming error). Here is a simpler implementation:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
bool is_duplicate_row(R_xlen_t r, LogicalMatrix x) {
    R_xlen_t i = 0, nr = x.nrow();
    const LogicalMatrix::Row& y = x.row(r);

    for (; i < r; i++) {
        if (is_true(all(y == x.row(i)))) {
            return true;
        }
    }
    for (i = r + 1; i < nr; i++) {
        if (is_true(all(y == x.row(i)))) {
            return true;
        }
    }

    return false;
}

In the spirit of my advice above,

  • const LogicalMatrix::Row& y = x.row(r); gives us a constant reference to the rth row of the matrix
  • x.row(i) refers to the ith row of x

Both of these expressions avoid element-wise copying via for loop, and are more readable IMO. Additionally, while there is certainly nothing wrong with using std::equal or any other standard algorithms, using Rcpp sugar expressions such as is_true(all(y == x.row(i))) can often simplify your code even further.


set.seed(123)
m <- matrix(rbinom(1000 * 250, 1, 0.25) > 0, 1000)
m[600,] <- m[2,]

which(sapply(1:nrow(m) - 1, is_duplicate_row, m))
# [1]   2 600

c(which(duplicated(m, fromLast = TRUE)), which(duplicated(m)))
# [1]   2 600
查看更多
登录 后发表回答