I am brand new to Python, seems like it has a lot of flexibility and is faster than traditional RDBMS systems.
Working on a very simple process to create random fantasy teams. I come from an RDBMS background (Oracle SQL) and that does not seem to be optimal for this data processing.
I made a dataframe using pandas read from csv file and now have a simple dataframe with two columns -- Player, Salary:
` Name Salary
0 Jason Day 11700
1 Dustin Johnson 11600
2 Rory McIlroy 11400
3 Jordan Spieth 11100
4 Henrik Stenson 10500
5 Phil Mickelson 10200
6 Justin Rose 9800
7 Adam Scott 9600
8 Sergio Garcia 9400
9 Rickie Fowler 9200`
What I am trying to do via python (pandas) is produce all combinations of 6 players which salary is between a certain amount 45000 -- 50000.
In looking up python options, I found the itertools combination interesting, but it would result a massive list of combinations without filtering the sum of salary.
In traditional SQL, I would do a massive merge cartesian join w/ SUM, but then I get the players in different spots..
Such as A, B, C then, C, B, A..
My traditional SQL which doesn't work well enough is something like this:
` SELECT distinct
ONE.name AS "1",
TWO.name AS "2",
THREE.name AS "3",
FOUR.name AS "4",
FIVE.name AS "5",
SIX.name AS "6",
sum(one.salary + two.salary + three.salary + four.salary + five.salary + six.salary) as salary
FROM
nl.pgachamp2 ONE,nl.pgachamp2 TWO,nl.pgachamp2 THREE, nl.pgachamp2 FOUR,nl.pgachamp2 FIVE,nl.pgachamp2 SIX
where ONE.name != TWO.name
and ONE.name != THREE.name
and one.name != four.name
and one.name != five.name
and TWO.name != THREE.name
and TWO.name != four.name
and two.name != five.name
and TWO.name != six.name
and THREE.name != four.name
and THREE.name != five.name
and three.name != six.name
and five.name != six.name
and four.name != six.name
and four.name != five.name
and one.name != six.name
group by ONE.name, TWO.name, THREE.name, FOUR.name, FIVE.name, SIX.name`
Is there a way to do this in Pandas/Python?
Any documentation that can be pointed to would be great!
Here's a non-Pandas solution using a simple algorithm. It generates combinations recursively from a list of players sorted by salary. This lets it skip generating combinations that exceed the salary cap.
As piRSquared mentions, there are no teams of 6 that fall within the salary limits stated in the question, so I chose limits to generate a small number of teams.
output
If you're using an older version of Python that doesn't support the
yield from
syntax you can replacewith
I ran this for combinations of 6 and found no teams that satisfied. I used 5 instead.
This should get you there:
As mentioned in the comments, this is a constraint satisfaction problem. It has a combinatorial part but since you defined no objectives to minimize or maximize, it is not an optimization problem (yet). You can approach this problems in many ways: you can try brute-force like piRSquared or use a heuristic algorithm like PM 2Ring. I will present a solution with 0-1 linear programming, and use PuLP library to model and solve the problem.
In order to model the problem, first you need to define decision variables. Here, the decision variable will be an indicator variable for each player: it will be 1 if that player is selected, 0 otherwise. Here's how you do it in PuLP:
Next, you create a problem:
As I mentioned earlier, your question does not state any objectives. You only want to create all possible teams. Therefore, we will define an arbitrary objective function (I will turn back to this arbitrary function again).
You mainly have two constraints:
1) The team will have 5 players:
Remember that players dictionary stores our decision variables.
players[player]
is either 1 (if that player is in the team) or 0 (otherwise). Therefore, if you sum all of them, the result should be equal to 5.2) Total salary should be between 45k and 50k.
This is similar to the first constraint. Here, we are not counting but instead summing the salaries (when the player is in the team, the value will be 1 so it will be multiplied by the corresponding salary. Otherwise, the value will be zero and the multiplication will also be zero).
The main modeling is done here. If you call
prob.solve()
, it will find a solution that satisfies these constraints. Normally, in optimization problems, we provide an objective function and try to maximize or minimize that. For example, assume that you have scores for the skills of players. Your budget is limited, you cannot go ahead and select top 5 players. So, in the part where we statedprob += 0
, you can define an objective function to maximize total skill score. But that wasn't what you were looking for so let's continue.Once you find a solution, you can add another constraint to the problem stating that the next solution should be different than this, you can generate all solutions.
prob.solve()
is the method that solves the problem. Based on the result, it returns an integer. If it finds an optimal solution, the result is 1. For infeasible or unbounded solutions there are different codes. So as long as we can find new solutions, we continue the loop.In the loop we first append the current solution to our
feasible_solutions
list. Then, we add another constraint: for these 5 players, the sum of the variables cannot exceed 4 (The largest value 5 and if it is 5, we know that this is the same solution).If you run this, you will have the same result piRSquared got.
So, what is the advantage of this?
The main reason we use integer/binary linear programming is that number of combinations grows really quickly. This is called combinatorial explosion. Take a look at the number of possible teams (without any constraint):
It becomes almost impossible to evaluate all combinations.
Of course when you want to list all combinations that satisfies those constraints you are still running that risk. If the number of combinations satisfying the constraints is large, it will take a lot of time to compute. You cannot avoid that. However, if that subset is small or it is still large but you are evaluating a function on that set, it will be much better.
For example, consider the following DataFrame:
268 of 75287520 solutions satisfy the salary constraint. It took 44 seconds in my computer to list them. It would take hours to find them using brute-force (update: it takes 8 hours and 21 minutes).
PuLP uses an open source solver, Cbc, by default. There are other open source/commercial alternative solvers that you can use with PuLP. Commercial ones are generally faster as expected (they are very expensive though).
Another alternative is constraint programming as I mentioned in the comments. For these kind of problems, you can find many smart ways to reduce the search space with constraint programming. I am comfortable with integer programming so I showed a model based on that but I should note that constraint programming might be way better for this.