Python Pandas self join for merge cartesian produc

2019-04-11 15:32发布

站内文章 / Python

29 0

叛逆

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am brand new to Python, seems like it has a lot of flexibility and is faster than traditional RDBMS systems.

Working on a very simple process to create random fantasy teams. I come from an RDBMS background (Oracle SQL) and that does not seem to be optimal for this data processing.

I made a dataframe using pandas read from csv file and now have a simple dataframe with two columns -- Player, Salary:

`                    Name  Salary
0              Jason Day   11700
1         Dustin Johnson   11600
2           Rory McIlroy   11400
3          Jordan Spieth   11100
4         Henrik Stenson   10500
5         Phil Mickelson   10200
6            Justin Rose    9800
7             Adam Scott    9600
8          Sergio Garcia    9400
9          Rickie Fowler    9200`

What I am trying to do via python (pandas) is produce all combinations of 6 players which salary is between a certain amount 45000 -- 50000.

In looking up python options, I found the itertools combination interesting, but it would result a massive list of combinations without filtering the sum of salary.

In traditional SQL, I would do a massive merge cartesian join w/ SUM, but then I get the players in different spots..

Such as A, B, C then, C, B, A..

My traditional SQL which doesn't work well enough is something like this:

` SELECT distinct
ONE.name AS "1", 
  TWO.name AS "2",
    THREE.name AS "3",
      FOUR.name AS "4", 
  FIVE.name AS "5", 
  SIX.name AS "6",
   sum(one.salary + two.salary + three.salary + four.salary + five.salary + six.salary) as salary
  FROM 
  nl.pgachamp2 ONE,nl.pgachamp2 TWO,nl.pgachamp2 THREE, nl.pgachamp2 FOUR,nl.pgachamp2 FIVE,nl.pgachamp2 SIX
 where ONE.name != TWO.name
 and ONE.name != THREE.name
 and one.name != four.name
 and one.name != five.name
 and TWO.name != THREE.name
 and TWO.name != four.name
 and two.name != five.name
 and TWO.name != six.name
 and THREE.name != four.name
 and THREE.name != five.name
 and three.name != six.name
 and five.name != six.name
 and four.name != six.name
 and four.name != five.name
 and one.name != six.name
 group by ONE.name, TWO.name, THREE.name, FOUR.name, FIVE.name, SIX.name`

Is there a way to do this in Pandas/Python?

Any documentation that can be pointed to would be great!

回答1:

Here's a non-Pandas solution using a simple algorithm. It generates combinations recursively from a list of players sorted by salary. This lets it skip generating combinations that exceed the salary cap.

As piRSquared mentions, there are no teams of 6 that fall within the salary limits stated in the question, so I chose limits to generate a small number of teams.

#!/usr/bin/env python3

''' Limited combinations

    Generate combinations of players whose combined salaries fall within given limits

    See http://stackoverflow.com/q/38636460/4014959

    Written by PM 2Ring 2016.07.28
'''

data = '''\
0              Jason Day   11700
1         Dustin Johnson   11600
2           Rory McIlroy   11400
3          Jordan Spieth   11100
4         Henrik Stenson   10500
5         Phil Mickelson   10200
6            Justin Rose    9800
7             Adam Scott    9600
8          Sergio Garcia    9400
9          Rickie Fowler    9200
'''

data = [s.split() for s in data.splitlines()]
all_players = [(' '.join(u[1:-1]), int(u[-1])) for u in data]
all_players.sort(key=lambda t: t[1])
for i, row in enumerate(all_players):
    print(i, row)
print('- '*40)

def choose_teams(free, num, team=(), value=0):
    num -= 1
    for i, p in enumerate(free):
        salary = all_players[p][1]
        newvalue = value + salary
        if newvalue <= hi:
            newteam = team + (p,)
            if num == 0:
                if newvalue >= lo:
                    yield newteam, newvalue
            else:
                yield from choose_teams(free[i+1:], num, newteam, newvalue)
        else:
            break

#Salary limits
lo, hi = 55000, 60500

#Indices of players that can be chosen for a team 
free = tuple(range(len(all_players)))

for i, (t, s) in enumerate(choose_teams(free, 6), 1):
    team = [all_players[p] for p in t]
    names, sals = zip(*team)
    assert sum(sals) == s
    print(i, t, names, s)

output

0 ('Rickie Fowler', 9200)
1 ('Sergio Garcia', 9400)
2 ('Adam Scott', 9600)
3 ('Justin Rose', 9800)
4 ('Phil Mickelson', 10200)
5 ('Henrik Stenson', 10500)
6 ('Jordan Spieth', 11100)
7 ('Rory McIlroy', 11400)
8 ('Dustin Johnson', 11600)
9 ('Jason Day', 11700)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
1 (0, 1, 2, 3, 4, 5) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson') 58700
2 (0, 1, 2, 3, 4, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jordan Spieth') 59300
3 (0, 1, 2, 3, 4, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Rory McIlroy') 59600
4 (0, 1, 2, 3, 4, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Dustin Johnson') 59800
5 (0, 1, 2, 3, 4, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jason Day') 59900
6 (0, 1, 2, 3, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jordan Spieth') 59600
7 (0, 1, 2, 3, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Rory McIlroy') 59900
8 (0, 1, 2, 3, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Dustin Johnson') 60100
9 (0, 1, 2, 3, 5, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jason Day') 60200
10 (0, 1, 2, 3, 6, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Jordan Spieth', 'Rory McIlroy') 60500
11 (0, 1, 2, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60000
12 (0, 1, 2, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60300
13 (0, 1, 2, 4, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Dustin Johnson') 60500
14 (0, 1, 3, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60200
15 (0, 1, 3, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60500
16 (0, 2, 3, 4, 5, 6) ('Rickie Fowler', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60400

If you're using an older version of Python that doesn't support the yield from syntax you can replace

yield from choose_teams(free[i+1:], num, newteam, newvalue)

with

for t, v in choose_teams(free[i+1:], num, newteam, newvalue):
    yield t, v

回答2:

As mentioned in the comments, this is a constraint satisfaction problem. It has a combinatorial part but since you defined no objectives to minimize or maximize, it is not an optimization problem (yet). You can approach this problems in many ways: you can try brute-force like piRSquared or use a heuristic algorithm like PM 2Ring. I will present a solution with 0-1 linear programming, and use PuLP library to model and solve the problem.

from pulp import *
import pandas as pd
df = df.set_index('Name')
feasible_solutions = []

In order to model the problem, first you need to define decision variables. Here, the decision variable will be an indicator variable for each player: it will be 1 if that player is selected, 0 otherwise. Here's how you do it in PuLP:

players = LpVariable.dicts('player', df.index.tolist(), 0, 1, LpInteger)

Next, you create a problem:

prob = pulp.LpProblem('Team Selection', pulp.LpMinimize)

As I mentioned earlier, your question does not state any objectives. You only want to create all possible teams. Therefore, we will define an arbitrary objective function (I will turn back to this arbitrary function again).

prob += 0

You mainly have two constraints:

1) The team will have 5 players:

prob += lpSum([players[player] for player in players]) == 5

Remember that players dictionary stores our decision variables. players[player] is either 1 (if that player is in the team) or 0 (otherwise). Therefore, if you sum all of them, the result should be equal to 5.

2) Total salary should be between 45k and 50k.

prob += lpSum([players[player] * df.at[player, 'Salary'] 
                                        for player in players]) <= 50000
prob += lpSum([players[player] * df.at[player, 'Salary'] 
                                        for player in players]) >= 45000

This is similar to the first constraint. Here, we are not counting but instead summing the salaries (when the player is in the team, the value will be 1 so it will be multiplied by the corresponding salary. Otherwise, the value will be zero and the multiplication will also be zero).

The main modeling is done here. If you call prob.solve(), it will find a solution that satisfies these constraints. Normally, in optimization problems, we provide an objective function and try to maximize or minimize that. For example, assume that you have scores for the skills of players. Your budget is limited, you cannot go ahead and select top 5 players. So, in the part where we stated prob += 0, you can define an objective function to maximize total skill score. But that wasn't what you were looking for so let's continue.

Once you find a solution, you can add another constraint to the problem stating that the next solution should be different than this, you can generate all solutions.

while prob.solve() == 1:
    current_solution = [player for player in players if value(players[player])]
    feasible_solutions.append(current_solution)
    prob += lpSum([players[player] for player in current_solution]) <= 4

prob.solve() is the method that solves the problem. Based on the result, it returns an integer. If it finds an optimal solution, the result is 1. For infeasible or unbounded solutions there are different codes. So as long as we can find new solutions, we continue the loop.

In the loop we first append the current solution to our feasible_solutions list. Then, we add another constraint: for these 5 players, the sum of the variables cannot exceed 4 (The largest value 5 and if it is 5, we know that this is the same solution).

If you run this, you will have the same result piRSquared got.

So, what is the advantage of this?

The main reason we use integer/binary linear programming is that number of combinations grows really quickly. This is called combinatorial explosion. Take a look at the number of possible teams (without any constraint):

from scipy.misc import comb
comb(10, 5)
Out: 252.0

comb(20, 5)
Out: 15504.0

comb(50, 5)
Out: 2118760.0

comb(100, 5)
Out: 75287520.0

It becomes almost impossible to evaluate all combinations.

Of course when you want to list all combinations that satisfies those constraints you are still running that risk. If the number of combinations satisfying the constraints is large, it will take a lot of time to compute. You cannot avoid that. However, if that subset is small or it is still large but you are evaluating a function on that set, it will be much better.

For example, consider the following DataFrame:

import numpy as np
np.random.seed(0)
df = pd.DataFrame({'Name': ['player' + str(i).zfill(3) for i in range(100)],
                   'Salary': np.random.randint(0, 9600, 100)})

268 of 75287520 solutions satisfy the salary constraint. It took 44 seconds in my computer to list them. It would take hours to find them using brute-force (update: it takes 8 hours and 21 minutes).

PuLP uses an open source solver, Cbc, by default. There are other open source/commercial alternative solvers that you can use with PuLP. Commercial ones are generally faster as expected (they are very expensive though).

Another alternative is constraint programming as I mentioned in the comments. For these kind of problems, you can find many smart ways to reduce the search space with constraint programming. I am comfortable with integer programming so I showed a model based on that but I should note that constraint programming might be way better for this.

回答3:

I ran this for combinations of 6 and found no teams that satisfied. I used 5 instead.

This should get you there:

from itertools import combinations
import pandas as pd


s = df.set_index('Name').squeeze()
combos = pd.DataFrame([c for c in combinations(s.index, 5)])
combo_salary = combos.apply(lambda x: s.ix[x].sum(), axis=1)
combos[(combo_salary >= 45000) & (combo_salary <= 50000)]