Multiple Indexes for Dataframe Grouping

2019-07-15 07:09发布

I'll just start with the example and then break down what is happening.

This is a sample input:

DataFrame:

**Name**    **No.**      **Test**       ***Grade***
Bob        2123320        Math             Nan
Joe        2832883       English           90
John       2139300       Science           85
Bob        2123320        History          93
John       2234903        Math             99

Hopeful output:

**Name**         ********2139300*********     ********2234903*******
                  Math   English  Science     Math   English  Science 
  John            0       0         85        99        0          0

Like the title suggests, I am trying to apply multiple indexes. So basically it starts by looking for each name, and then for each name it finds to see how many distinct No.'s it has. In this case it sets a threshold at at least 2 distinct No.'s (which is why only John is outputted and Joe/Bob are not).

Now in each of these distinct No's. I have a specific subset of Tests I want to search for, in this case only {Math, English, Science}. For each of these tests, if the person in question took it in that No., there should be a grade. I would like that grade to be outputted for the test in question and for the tests not taken by that person on that No. I would like it to output some sort of simple marker (i.e if the person only took Math on that day, for English and Science output 0).

So in effect, it first indexes people by the number of distinct No.'s and groups them as such. It then indexes them by type of Test (for which I only want a subset). It finally assigns each person a value for the type of test they took and for the ones they didn't simply outputs an 0.

It's similar to another problem I asked earlier: Grouped Feature Matrix in Python #2- Follow Up

Except now instead of 1's and 0's I have another column with actual values that I would like to output.

Thank you.

EDIT: More sample/Output

 **Name**    **No.**      **Test**       ***Grade***
Bob        2123320        Math             Nan
Joe        2832883       English           90
John       2139300       Science           85
Bob        2123320        History          93
John       2234903        Math             99
Bob        2932848         English         99


  **Name**    2139300        2234903       2123320      2932848
          M   E    S      M   E    S    M   E    S    M   E    S
  John    0   0    85    99   0    0   Nan  Nan  Nan  Nan  Nan Nan
  Bob     Nan Nan  Nan   Nan  nan  Nan 86   0    0    0    99  0

2条回答
Animai°情兽
2楼-- · 2019-07-15 07:19

Let's use:

Filter the dataframe to only those records you are concerned with

df_out = df[df.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]

Now, reshape dataframe with set_index, unstack, and reindex:

df_out.set_index(['Name','No.','Test'])['Grade'].sum(level=[0,1,2])\
      .unstack(-1, fill_value=0)\
      .reindex(['Math','English','Science'], axis=1, fill_value=0)\
      .unstack(-1, fill_value=0).swaplevel(0, 1, axis=1)\
      .sort_index(1)

Output:

No.  2123320              2139300              2234903              2932848             
Test English Math Science English Math Science English Math Science English Math Science
Name                                                                                    
Bob        0    0       0       0    0       0       0    0       0      99    0       0
John       0    0       0       0    0      85       0   99       0       0    0       0
查看更多
别忘想泡老子
3楼-- · 2019-07-15 07:38

You can use pivot_table:

In [11]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"])
Out[11]:
No.  2123320 2139300 2234903 2832883
Test History Science    Math English
Name
Bob     93.0     NaN     NaN     NaN
Joe      NaN     NaN     NaN    90.0
John     NaN    85.0    99.0     NaN

With the dropna flag to include all the NaN columns:

In [12]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False)
Out[12]:
No.  2123320                      2139300                      2234903                       2832883
Test English History Math Science English History Math Science English History  Math Science English History Math Science
Name
Bob      NaN    93.0  NaN     NaN     NaN     NaN  NaN     NaN     NaN     NaN   NaN     NaN     NaN     NaN  NaN     NaN
Joe      NaN     NaN  NaN     NaN     NaN     NaN  NaN     NaN     NaN     NaN   NaN     NaN    90.0     NaN  NaN     NaN
John     NaN     NaN  NaN     NaN     NaN     NaN  NaN    85.0     NaN     NaN  99.0     NaN     NaN     NaN  NaN     NaN

and with fill_value=0

In [13]: df.pivot_table(values="Grade", index=["Name"], columns=["No.", "Test"], dropna=False, fill_value=0)
Out[13]:
No.  2123320                      2139300                      2234903                      2832883
Test English History Math Science English History Math Science English History Math Science English History Math Science
Name
Bob        0      93    0       0       0       0    0       0       0       0    0       0       0       0    0       0
Joe        0       0    0       0       0       0    0       0       0       0    0       0      90       0    0       0
John       0       0    0       0       0       0    0      85       0       0   99       0       0       0    0       0
查看更多
登录 后发表回答