How to identify columns that are sums of other col

2019-07-21 09:35发布

问题:

I would like to write a function (preferably in R, but other languages are welcome), which would identify relationships between columns (limited to additions/substractions) in a dataset. A practical application of this would be to run it on large multi-column financial datasets, where some of the columns are subtotals of other columns - and identify such subtotals.

Ideally, I would like to allow for small discrepancies - e.g. to allow for rounding issues leading to columns not adding up exactly 100%.

I found the following question which includes a solution involving matrices and ranks, but I am not sure if there is any way to incorporate the ability to handle the noise in data arising from the rounding issues.

As an example:

d = data.frame(a=c(10.12, 20.02, 30.08, 20.19), b=c(12.12, 20.45, 20.52, 16.72), c=c(11, 123.25, 20.67, 20.78))
d$d = d$a + d$b
d$e = d$d + d$c
> d
      a     b      c     d      e
1 10.12 12.12  11.00 22.24  33.24
2 20.02 20.45 123.25 40.47 163.72
3 30.08 20.52  20.67 50.60  71.27
4 20.19 16.72  20.78 36.91  57.69

magic_function(d)
[1] "d$d = d$a + d$b"
[2] "d$e = d$d + d$c" # or "d$e = d$a + d$b + d$c" (first option preferred)

The solution in the linked question works well until I introduce noise into equation. e.g. d$d[[4]] = d$d[[4]] + 0.01 - then it no longer works at all. My question thus is:

  1. Are there any other methods to identifying relationships between columns (especially if they are restricted to simple addition/subtraction)
  2. Are any of the methods able to address the imperfect data quality issue or do I need to build some additional functionality for it (e.g. round the data before running it through the rank identification algorithm).

回答1:

Here is one idea which will work if you only need to check if any column is a result of the sum of any TWO others. It also allows you to add noise. We basically first create a data frame by adding all combinations of original data set. We then subtract each column of data set with created data frame. If all the values are 0 it means they match. By using colSums(i < 0.01) == nrow(i)), we are able to add the required noise.

d2 <- setNames(data.frame(combn(1:ncol(d), 2, function(i) rowSums(d[i]))), 
                combn(names(d), 2, function(j)paste(j, collapse = ' + ')))

l1 <- lapply(d, function(i) sapply(d2, function(j) Map(function(x, y)abs(x - y), i, j)))

lapply(l1, function(i) names(which(colSums(i < 0.01) == nrow(i))))

#$a
#character(0)

#$b
#character(0)

#$c
#character(0)

#$d
#[1] "a + b"

#$e
#[1] "c + d"

Or make it a function with noise as input argument,

f1 <- function(df, noise){
  d2 <- setNames(data.frame(combn(1:ncol(df), 2, function(i) rowSums(df[i]))), 
                 combn(names(df), 2, function(j)paste(j, collapse = ' + ')))
  l1 <- lapply(df, function(i) sapply(d2, function(j) 
                       Map(function(x, y)abs(x - y), i, j)))
  Filter(length, lapply(l1, function(i) 
                names(which(colSums(i < noise) == nrow(i)))))
}

f1(d, 0.01)
#$d
#[1] "a + b"

#$e
#[1] "c + d"

If we want to make it more flexible, then we can add another argument to take the combination number (of columns), i.e.

f1 <- function(df, n, noise){
  d2 <- setNames(data.frame(combn(1:ncol(df), n, function(i) rowSums(df[i]))), 
                 combn(names(df), n, function(j)paste(j, collapse = ' + ')))
  l1 <- lapply(df, function(i) sapply(d2, function(j) 
                       Map(function(x, y)abs(x - y), i, j)))
  Filter(length, lapply(l1, function(i) 
                names(which(colSums(i < noise) == nrow(i)))))
}

sapply(2:3, function(i) f1(d, i, 0.01))
#[[1]]
#[[1]]$d
#[1] "a + b"

#[[1]]$e
#[1] "c + d"

#[[2]]
#[[2]]$e
#[1] "a + b + c"


回答2:

If you allow the sums to only be for consecutive columns, and for previous values only, the computational effort for this is probably tractable for 10-20 columns. This procedure checks to see if the column is equal to the sum of previous consecutive columns, with some allowance for error:

d <- data.frame(a=c(10.12, 20.02, 30.08, 20.19),
                b=c(12.12, 20.45, 20.52, 16.72),
                c=c(11, 123.25, 20.67, 20.78));
d$d <- round(d$a + d$b + runif(4,0,0.04),2);
d$e <- round(d$d + d$c + runif(4,0,0.04),2);

## Assumptions:
## * sum columns relate to previous values only
## * sum columns relate to consecutive columns

sumColumns <- NULL;
allowedError <- 0.05;
for(col in 3:ncol(d)){
    for(subStart in 1:(col-2)){
        for(subEnd in (subStart+1):(col-1)){
            if(all(abs(d[,col] - rowSums(d[,subStart:subEnd, drop=FALSE])) <
                   allowedError)){
                cat(sprintf("Column %d is a sum of columns %d-%d\n",
                            col, subStart, subEnd));
                sumColumns[col] <- TRUE;
            }
        }
    }
}

Output:

Column 4 is a sum of columns 1-2
Column 5 is a sum of columns 3-4

This could be modified to allow for consecutive columns together with any number of sum columns while maintaining tractability (assuming the number of sum columns were kept low). That modification is not completely trivial, and is left as an exercise to the reader.