Python Pandas Create New Bin/Bucket Variable with

2019-03-15 18:13发布

问题:

How do you create a new Bin/Bucket Variable using pd.qut in python?

This might seem elementary to experienced users but I was not super clear on this and it was surprisingly unintuitive to search for on stack overflow/google. Some thorough searching yielded this (Assignment of qcut as new column) but it didn't quite answer my question because it didn't take the last step and put everything into bins (i.e. 1,2,...).

回答1:

In Pandas 0.15.0 or newer, pd.qcut will return a Series, not a Categorical if the input is a Series (as it is, in your case) or if labels=False. If you set labels=False, then qcut will return a Series with the integer indicators of the bins as values.

So to future-proof your code, you could use

data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)

or, pass a NumPy array to pd.qcut so you get a Categorical as the return value. Note that the Categorical attribute labels is deprecated. Use codes instead:

data3['bins_spd'] = pd.qcut(data3['spd_pct'].values, 5).codes


回答2:

EDIT: The below answer is only valid for versions of Pandas less than 0.15.0. If you are running Pandas 15 or higher, see:

data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)

Thanks to @unutbu for pointing it out. :)

Say you have some data that you want to bin, in my case options spreads, and you want to make a new variable with the buckets corresponding to each observation. The link mentioned above that you can do this by:

print pd.qcut(data3['spd_pct'], 40)

(0.087, 0.146]
(0.0548, 0.087]
(0.146, 0.5]
(0.146, 0.5]
(0.087, 0.146]
(0.0548, 0.087]
(0.5, 2]

which gives you what the bin endpoints are that correspond to each observation. However, if you would like the corresponding bin numbers for each observation then you can do this:

print pd.qcut(data3['spd_pct'],5).labels

[2 1 3 ..., 0 1 4] 

Putting it all together if you would like to create a new variable with just the bin numbers, this should suffice:

data3['bins_spd']=pd.qcut(data3['spd_pct'],5).labels

print data3.head()

   secid      date    symbol  symbol_flag     exdate   last_date cp_flag  0   5005  1/2/1997  099F2.37            0  1/18/1997         NaN       P   
1   5005  1/2/1997  09B0B.1B            0  2/22/1997   12/3/1996       P   
2   5005  1/2/1997  09B7C.2F            0  2/22/1997  12/11/1996       P   
3   5005  1/2/1997  09EE6.6E            0  1/18/1997  12/27/1996       C   
4   5005  1/2/1997  09F2F.CE            0  8/16/1997         NaN       P   

   strike_price  best_bid  best_offer     ...      close  volume_y    return  0          7500     2.875      3.2500     ...        4.5     99200  0.074627   
1         10000     5.375      5.7500     ...        4.5     99200  0.074627   
2          5000     0.625      0.8750     ...        4.5     99200  0.074627   
3          5000     0.125      0.1875     ...        4.5     99200  0.074627   
4          7500     3.000      3.3750     ...        4.5     99200  0.074627   

   cfadj_y  open  cfret  shrout      mid   spd_pct  bins_spd  
0        1   4.5      1   57735  3.06250  0.122449         2  
1        1   4.5      1   57735  5.56250  0.067416         1  
2        1   4.5      1   57735  0.75000  0.333333         3  
3        1   4.5      1   57735  0.15625  0.400000         3  
4        1   4.5      1   57735  3.18750  0.117647         2  

[5 rows x 35 columns]

Hope this helps somebody else. At the very least it should be easier to search for now. :)