I have a simple question. I would like to sum of two non-parametric distributions.
Here is an example. There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1?
If I just use A1+B1
in R, it gives 12 15 18 20 20 22 22 24 26 29
. However, I don't think this is right. Becuase there is not order in houses.
When I change the order of houses, it gives another results.
# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1)
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2)
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3
Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!
(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)
Edit:
Now that I better understand the question, and see @jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.
Rather than relying on the values in
A1
andB1
being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use thedensity
function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from thedensity()$x
values, where thesample
's we take are weighted withprob=density()$y
... i.e., peaks in the density plot are at x-values that should be resample more often.As a heuristic, an oversimplified statement could say that
mean(A1)
is 3.8, andmean(B1)
is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.
The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.
Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the
expand.grid
approach I get aError: cannot allocate vector of size 372529.0 Gb
error, whereas thesample_sum
approach takes 0.12 seconds.Of course, if the answer doesn't help you, the speed is worthless ;)
In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:
So, I think this case can be computed by
convolution
.Note:
You probably want something like:
Using
expand.grid
will get you a dataframe of all combinations of A1 and B1, androwSums
will add them.Is it not the case that sorting the distribution prior to adding solves this problem?