I've been tasked with helping some accountants solve a common problem they have - given a list of transactions and a total deposit, which transactions are part of the deposit? For example, say I have this list of numbers:
1.00
2.50
3.75
8.00
And I know that my total deposit is 10.50
, I can easily see that it's made up of the 8.00
and 2.50
transaction. However, given a hundred transactions and a deposit in the millions, it quickly becomes much more difficult.
In testing a brute force solution (which takes way too long to be practical), I had two questions:
With a list of about 60 numbers, it seems to find a dozen or more combinations for any total that's reasonable. I was expecting a single combination to satisfy my total, or maybe a few possibilities, but there always seem to be a ton of combinations. Is there a math principle that describes why this is? It seems that given a collection of random numbers of even a medium size, you can find a multiple combination that adds up to just about any total you want.
I built a brute force solution for the problem, but it's clearly O(n!), and quickly grows out of control. Aside from the obvious shortcuts (exclude numbers larger than the total themselves), is there a way to shorten the time to calculate this?
Details on my current (super-slow) solution:
The list of detail amounts is sorted largest to smallest, and then the following process runs recursively:
- Take the next item in the list and see if adding it to your running total makes your total match the target. If it does, set aside the current chain as a match. If it falls short of your target, add it to your running total, remove it from the list of detail amounts, and then call this process again
This way it excludes the larger numbers quickly, cutting the list down to only the numbers it needs to consider. However, it's still n! and larger lists never seem to finish, so I'm interested in any shortcuts I might be able to take to speed this up - I suspect that even cutting 1 number out of the list would cut the calculation time in half.
Thanks for your help!
C# version
setup test:
code:
results:
If subTotals are repeated, there will appear to be duplicate results (the desired effect). In reality, you will probably want to use the subTotal Tupled with some ID, so you can relate it back to your data.
There is a cheap Excel Add-in that solves this problem: SumMatch
If I understand your problem correctly, you have a set of transactions, and you merely wish to know which of them could have been included in a given total. So if there are 4 possible transactions, then there are 2^4 = 16 possible sets to inspect. This problem is, for 100 possible transactions, the search space has 2^100 = 1267650600228229401496703205376 possible combinations to search over. For 1000 potential transactions in the mix, it grows to a total of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
sets that you must test. Brute force will hardly be a viable solution on these problems.
Instead, use a solver that can handle knapsack problems. But even then, I'm not sure that you can generate a complete enumeration of all possible solutions without some variation of brute force.
The Excel Solver Addin as posted over on superuser.com has a great solution (if you have Excel) https://superuser.com/questions/204925/excel-find-a-subset-of-numbers-that-add-to-a-given-total
This special case of the Knapsack problem is called Subset Sum.
Its kind of like 0-1 Knapsack problem which is NP-complete and can be solved through dynamic programming in polynomial time.
http://en.wikipedia.org/wiki/Knapsack_problem
But at the end of the algorithm you also need to check that the sum is what you wanted.