R: efficiently computing summaries of value-subset

2019-05-21 13:28发布

站内文章 / 前沿技术

35 0

在下西门庆

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have two tables, A and B. For each row of table A, I want to get some summary statistics for B$value where the value of B$location is within 100 of A$location. I've accomplished this using the for-loop below, but this is a slow solution that works well when the tables are small but I would like to scale up to a table A which is thousands of rows and a table B which is nearly a millions of rows. Any ideas of how to achieve this? Thanks in advance!

The for-loop:

for (i in 1:nrow(A)) {    
   subset(B, abs(A$location[i] - B$location) <= 100) -> temp
   A$n[i] <- nrow(temp)
   A$sum[i] <- sum(temp$value)
   A$avg[i] <- mean(temp$value)
}

An example:
A loc 150 250 400
B loc value 25 7 77 19 170 10 320 15

Would become:
A loc n sum avg 150 2 29 14.5 250 2 25 12.5 400 1 15 15

回答1:

Similar to Matt Summersgill's answer, you could do a non-equi join to update A:

A[, up := loc + 100]
A[, dn := loc - 100]
A[, c("n", "s", "m") := 
  B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][, .(N, V2, V3)]
]

Or in one chained command:

A[, up := loc + 100][, dn := loc - 100][, c("n", "s", "m") := 
  B[copy(.SD), on=.(loc >= dn, loc <= up), 
    .(.N, sum(value), mean(value)), by=.EACHI][, 
    .(N, V2, V3)]
]

This should be fairly efficient, I guess.

How it works

Inside j of x[i, j], .SD refers to the subset of data from x (in this case it's all of A).

x[i, on=, j, by=.EACHI] is a join, using each row of i (in this case copy(.SD) == A) to look up matching rows of x (in this case B) using the conditions in on=. For each row of i, j is calculated (which is what by=.EACHI means).

When j doesn't have names, they are assigned automatically. V1, V2, and so on. .N by default gets named N.

回答2:

My pure R solution (below) is still considerably slow, in my system it required 32 seconds to finish Matt Summersgill's big example, but compared to other solutions, it's still reasonable.

The logic of my solution is that, since the inputs are sorted, as you consider new elements of A_loc, the range of values in B_loc will either stay the same if the new A_loc element is identical to the previous, or it will shift to the right in B_loc, possibly contracting or expanding. Note that if you were working with double inputs, you'd have to be a bit more careful with the comparisons.

This C++ version is naturally faster. If you can Rcpp::sourceCpp this code:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame foo(IntegerVector A_loc, IntegerVector B_loc, IntegerVector B_val) {
    IntegerVector n(A_loc.length());
    IntegerVector sum(A_loc.length());
    NumericVector avg(A_loc.length());

    int lower = 0;
    int upper = 0;
    int count = 0;
    int current_sum = 0;
    for (int i = 0; i < A_loc.length(); i++) {
        checkUserInterrupt();

        while (lower < B_loc.length()) {
            if (B_loc[lower] >= A_loc[i] - 100) {
                break;
            }

            if (count > 0) {
                count--;
                current_sum -= B_val[lower];
            }

            lower++;
        }

        if (upper < lower) {
            upper = lower;
        }

        while (upper < B_loc.length()) {
            if (B_loc[upper] > A_loc[i] + 100) {
                break;
            }

            count++;
            current_sum += B_val[upper++];
        }

        n[i] = count;
        sum[i] = current_sum;
        avg[i] = static_cast<double>(current_sum) / count;
    }

    DataFrame df = DataFrame::create(
        Named("loc") = A_loc,
        Named("n") = n,
        Named("sum") = sum,
        Named("avg") = avg
    );

    return df;
}

then this:

A <- data.frame(loc = sample.int(1000, size = 1e4, replace = TRUE))


B <- data.frame(loc = sample.int(1000, size = 1e6, replace = TRUE),
                value = sample.int(100, size = 1e6, replace = TRUE))

test <- function() {
    # remove unique if you want to consider duplicated values
    A_loc <- sort(unique(A$loc), decreasing = FALSE)
    B <- B[order(B$loc),]
    out <- foo(A_loc, B$loc, B$value)
}

microbenchmark::microbenchmark(test())

shows these timings:

Unit: milliseconds
   expr      min      lq     mean   median       uq      max neval
 test() 44.74469 45.8118 51.35361 47.34657 48.99376 95.00938   100

If you can't use Rcpp, then consider the R version below, or Frank's solution with data.table, I think sorting the inputs might also help that case?

for loops are usually avoided in R, but I don't think they are always slow, you just have to be careful not to copy data too much. Also, since R v3.5.0, writing something like for i in 1:10 no longer allocates the whole vector first, it supports compact representation.

A_loc <- sort(unique(A$loc), decreasing = FALSE)
B <- B[order(B$loc),]

out <- data.frame(loc = A_loc,
                  n = 0L,
                  sum = 0L,
                  avg = 0)

lower <- 1L
upper <- 1L
count <- 0L
sum <- 0L
upper_limit <- nrow(B)
for (i in seq_along(A_loc)) {
  current_loc <- A_loc[i]

  while (lower <= upper_limit) {
    if (B$loc[lower] >= current_loc - 100L) {
      break
    }

    if (count > 0L) {
      count <- count - 1L
      sum <- sum - B$value[lower]
    }

    lower <- lower + 1L
  }

  if (upper < lower) {
    upper <- lower
  }

  while (upper <= upper_limit) {
    if (B$loc[upper] > current_loc + 100L) {
      break
    }

    count <- count + 1L
    sum <- sum + B$value[upper]
    upper <- upper + 1L
  }

  out$n[i] <- count
  out$sum[i] <- sum
  out$avg[i] <- sum / count
}

回答3:

Here's a tidyverse solution

library(tidyverse)

A = read.table(text = "
loc
150
250
400
", header=T)

B = read.table(text = "
loc    value
25     7
77     19
170    10
320    15
", header=T)

A %>%
  mutate(B = list(B)) %>%              # create all combinations of rows of A and B
  unnest() %>%
  filter(abs(loc - loc1) <= 100) %>%   # keep rows that satisfy your condition
  group_by(loc) %>%                    # for each loc values
  summarise(sum = sum(value),          # calculate sum
            avg = mean(value))         # calculate mean

# # A tibble: 3 x 3
#     loc   sum   avg
#    <int> <int> <dbl>
# 1   150    29  14.5
# 2   250    25  12.5
# 3   400    15  15

Maybe not the best solution if you have large A and B tables as you have to create all combinations of rows and then filter.

回答4:

This is possible with the foverlaps function within data.table and the following method actually has a prayer at finishing your actual use case -- A which is thousands of rows and a table B which is nearly a millions of rows -- in a reasonable amount of time.

With your toy example:

library(data.table)

A <- fread("
           loc
           150
           250
           400")

B <- fread("
           loc    value
           25     7
           77     19
           170    10
           320    15")

## Create a 'dummy' value to create an interval w/same start and end in A
A[,loc_Dummy := loc]

## Create values bounding the match range for loc in B
B[,loc_Plus100 := loc + 100]
B[,loc_Minus100 := loc - 100]

## Set up for the overlap join
setkey(A,loc,loc_Dummy)
setkey(B,loc_Minus100, loc_Plus100)

## Create a table of only matches instead of doing a full cartesian join of all cases
Matches <- foverlaps(A[,.(loc, loc_Dummy)],
                     B[,.(loc_Minus100,loc_Plus100,value)])

## Create a summary table
Matches[,.(n = .N, sum = sum(value), avg = mean(value)), by = .(loc)]

#    loc n sum  avg
# 1: 150 2  29 14.5
# 2: 250 2  25 12.5
# 3: 400 1  15 15.0

Scaling up - yikes!

However - this is actually an extremely computationally intensive problem. Scaling up to your actual case sizes shows the challenge here -- using 10,000 rows for table A and 1,000,000 rows for table B, this method completes in 91 seconds on the server I'm running on, but uses over 112 GB of memory!

A <- data.table(loc = sample.int(1000, size = 1e4, replace = TRUE))


B <- data.table(loc = sample.int(1000, size = 1e6, replace = TRUE),
                value = sample.int(100, size = 1e6, replace = TRUE))

system.time({
  A[,loc_Dummy := loc]
  B[,loc_Plus100 := loc + 100]
  B[,loc_Minus100 := loc - 100]

  setkey(A,loc,loc_Dummy)
  setkey(B,loc_Minus100, loc_Plus100)

  Matches <- foverlaps(A[,.(loc, loc_Dummy)],
                       B[,.(loc_Minus100,loc_Plus100,value)])

  Summary  <- Matches[,.(n = .N, sum = sum(value), avg = mean(value)), by = .(loc)]

})

## Warning: Memory usage peaks at ~112 GB!

# user  system elapsed 
# 56.407  46.258  91.135

This is barely within the capabilities of the server I use, and likely may not actually be applicable for your case.

If you don't have hundreds of Gigabytes of memory at your disposal, you'll probably have to get a little more clever in the way you approach this and iterate through chunks at a time.

From what I can tell, your problem is actually similar in ways to the one posed (and solved) by Lorenzo Busetto and detailed in a blog post: Speeding up spatial analyses by integrating sf and data.table: a test case.

Chunking to the rescue

Requiring over ~100 Gigabytes of memory isn't really a feasible solution -- especially if you wanted to scale A or B up by an order of magnitude at some point.

However, a chunking method (inspired by Lorenzo's post linked above) that splits up the problem into 100 chunks actually only increases by run-time a trivial amount to 116 seconds, but reduces peak memory usage to less than 3 GB! If I were planning on doing this in production, I'd go with something like the following.

One note: I didn't really do some in-depth auditing on the accuracy of the results (I might have specified one of the range bounds incorrectly open or closed), so I'd scrutinize the output with the data you're familiar with before putting into production.

A <- data.table(loc = sample.int(1000, size = 1e4, replace = TRUE))

B <- data.table(loc = sample.int(1000, size = 1e6, replace = TRUE),
                value = sample.int(100, size = 1e6, replace = TRUE))

system.time({

  A[,loc_Dummy := loc]
  B[,loc_Plus100 := loc + 100]
  B[,loc_Minus100 := loc - 100]

  setkey(A,loc)
  setkey(B,loc)

  ChunkCount <- 100
  ChunkSize <- A[,.N/ChunkCount]

  ResultList <- vector("list", ChunkCount) 

  for (j in seq_len(ChunkCount)){

    A_loc_Min <- A[((j-1)*ChunkSize + 1):(min(nrow(A),(j)*ChunkSize)), min(loc)]
    A_loc_Max <- A[((j-1)*ChunkSize + 1):(min(nrow(A),(j)*ChunkSize)), max(loc)]

    A_Sub <- A[loc >= A_loc_Min & loc < A_loc_Max]
    B_Sub <- B[loc_Plus100 >= A_loc_Min & loc_Minus100 < A_loc_Max]

    setkey(A_Sub,loc,loc_Dummy)
    setkey(B_Sub,loc_Minus100, loc_Plus100)

    Matches <- foverlaps(A_Sub[,.(loc, loc_Dummy)],
                         B_Sub[,.(loc_Minus100,loc_Plus100,value)])

    ResultList[[j]]  <- Matches[,.(n = .N, sum = sum(value), avg = mean(value)), by = .(loc)]

  }

  Summary  <- rbindlist(ResultList)

})

#    user  system elapsed 
# 109.125  16.864 116.129

Validating

Update: @Alexis and @Frank's suggestion in the comments generate the same result set, mine comes out slightly different, but only on the count. If someone else can verify that the correct answer is actually that provided by @Alexis/@Frank, then I'd be happy to retract my answer as both methods execute faster than the one I proposed.

set.seed(1234)

A <- data.table(loc = sample.int(1000, size = 1e3, replace = TRUE))

B <- data.table(loc = sample.int(1000, size = 1e4, replace = TRUE),
                value = sample.int(10, size = 1e4, replace = TRUE))



## Matt 
Matt_A <- copy(A)
Matt_B <- copy(B)

Matt_A[,loc_Dummy := loc]
Matt_B[,loc_Plus100 := loc + 100]
Matt_B[,loc_Minus100 := loc - 100]

setkey(Matt_A,loc,loc_Dummy)
setkey(Matt_B,loc_Minus100, loc_Plus100)

Matches <- foverlaps(Matt_A[,.(loc, loc_Dummy)],
                     Matt_B[,.(loc_Minus100,loc_Plus100,value)])

Summary_Matt  <- Matches[,.(n = .N, sum = sum(value), avg = mean(value)), keyby = .(loc)]


## Alexis

Rcpp::sourceCpp("RowRanges.cpp")

A_loc <- sort(A$loc, decreasing = FALSE)
B <- B[order(B$loc),]
Alexis <- foo(unique(A_loc), B$loc, B$value)

Summary_Alexis <- as.data.table(Alexis)
colnames(Summary_Alexis) <- c("n","sum","avg")

Summary_Alexis[,loc := unique(A_loc)]
setcolorder(Summary_Alexis, c("loc","n","sum","avg"))

## Frank

Frank <- A[, up := loc + 100][
  , dn := loc - 100][
    , c("n", "s", "m") := B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][
      , .(N, V2, V3)]][]

Summary_Frank <- unique(Frank[,.(loc,n, sum = s, avg = m)][order(loc)])

## Comparing

all.equal(Summary_Frank,Summary_Alexis)
# [1] TRUE

all.equal(Summary_Frank,Summary_Matt)
# [1] "Column 'n': Mean relative difference: 1.425292"

回答5:

I don't normally suggest solutions that rely on installing packages, but I think this one will do the trick for you. It will install a package that enables you to code in SQL inside R.

# Load the package
install.packages("sqldf")
library(sqldf)

# Create tables
A <- data.frame("loc"=c(150,250,400))
B <- data.frame("loc"=c(25,77,170,320),"value"=c(7,19,10,15))


# Join tables
df0 <- sqldf('select a.loc
                    ,count(b.value) as n_value
                    ,sum(b.value) as sum_value
                    ,avg(b.value) as avg_value
              from A as a
              left join B as b
              on abs(a.loc - b.loc) <= 100
              group by a.loc')

# Print data frame
df0

回答6:

I'm not sure how well this solution will scale - it depends on whether the filter matrix fits in memory.

A <- within(A,{
 B.filter <- outer(B$loc, A$loc, function(x, y) abs(x - y) <= 100) 

 n <- colSums(B.filter)
 sum <- colSums(B$value * B.filter)
 avg <- sum / n
 rm(B.filter)
})

If locations in A and/or B repeat, you might be able to reduce the size of the filter matrix by only using unique values:

A <- within(A,{
 B.filter <- outer(unique(B$loc), unique(A$loc), function(x, y) abs(x - y) <= 100) 
 colnames(B.filter) <- unique(A$loc)
 rownames(B.filter) <- unique(B$loc)

 n <- colSums(B.filter[,as.character(A$loc)])
 sum <- colSums(B$value * B.filter[as.character(B$loc),])
 avg <- sum / n
 rm(B.filter)
})