R: How to get a sum of two distributions?

2019-01-19 18:12发布

I have a simple question. I would like to sum of two non-parametric distributions.

Here is an example. There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.

A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B

I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1? If I just use A1+B1 in R, it gives 12 15 18 20 20 22 22 24 26 29. However, I don't think this is right. Becuase there is not order in houses.

When I change the order of houses, it gives another results.

# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1) 
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2) 
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3

enter image description here

Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!

(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)

4条回答
姐就是有狂的资本
2楼-- · 2019-01-19 18:55

Edit:

Now that I better understand the question, and see @jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.

Rather than relying on the values in A1 and B1 being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the density function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the density()$x values, where the sample's we take are weighted with prob=density()$y ... i.e., peaks in the density plot are at x-values that should be resample more often.

As a heuristic, an oversimplified statement could say that mean(A1) is 3.8, and mean(B1) is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.

sample_sum <- function(A, B, n, ...){
    qss <- function(X, n, ...){
        r_X <- range(X)
        dens_X <- density(X, ...)
        sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE)
    }

    sample_A <- qss(A, n=n, ...)
    sample_B <- qss(B, n=n, ...)

    sample_A + sample_B
}

ss <- sample_sum(A1, B1, n=100, from=0)

png("~/Desktop/answer.png", width=5, height=5, units="in", res=150)
plot(density(ss))
dev.off()

Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.

The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.

Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the expand.grid approach I get a Error: cannot allocate vector of size 372529.0 Gb error, whereas the sample_sum approach takes 0.12 seconds.

Of course, if the answer doesn't help you, the speed is worthless ;)

enter image description here

查看更多
Evening l夕情丶
3楼-- · 2019-01-19 18:59

In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:

PDF(Z) = PDF(Y) * PDF(X)

So, I think this case can be computed by convolution.

# your data
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B

# compute PDF/CDF
PDF_A1 <- table(A1)/length(A1)
CDF_A1 <- cumsum(PDF_A1)

PDF_B1 <- table(B1)/length(B1)
CDF_B1 <- cumsum(PDF_B1)

# compute the sum distribution 
PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open")

# plotting
plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1")
box()
axis(2)
# FIXME: is my understand for X correct?
axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1))))

enter image description here

Note:

CDF: cumulative distribution function

PDF: probability density function

## To make the x-values correspond to actually sums, consider
## compute PDF
## pad zeros in probability vectors to convolve
r <- range(c(A1, B1))
pdfA <- pdfB <- vector('numeric', diff(r)+1L)
PDF_A1 <- table(A1)/length(A1)                        # same as what you have done
PDF_B1 <- table(B1)/length(B1)
pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1)  # fill the values
pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1)

## compute the convolution and plot
res <- convolve(pdfA, rev(pdfB), type = "open")
plot(res, type="h", xlab='Sum', ylab='')

enter image description here

## In this simple case (with discrete distribution) you can compare
## to previous solution
tst <- rowSums(expand.grid(A1, B1))
plot(table(tst) / sum(as.vector(table(tst))), type='h')

enter image description here

查看更多
干净又极端
4楼-- · 2019-01-19 19:12

You probably want something like:

rowSums(expand.grid(A1, B1))

Using expand.grid will get you a dataframe of all combinations of A1 and B1, and rowSums will add them.

查看更多
叼着烟拽天下
5楼-- · 2019-01-19 19:12

Is it not the case that sorting the distribution prior to adding solves this problem?

A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
sort(A1)+sort(B1)
查看更多
登录 后发表回答