How do you create a new Bin/Bucket Variable using pd.qut in python?
This might seem elementary to experienced users but I was not super clear on this and it was surprisingly unintuitive to search for on stack overflow/google. Some thorough searching yielded this (Assignment of qcut as new column) but it didn't quite answer my question because it didn't take the last step and put everything into bins (i.e. 1,2,...).
In Pandas 0.15.0 or newer, pd.qcut
will return a Series, not a Categorical if the input is a Series (as it is, in your case) or if labels=False
. If you set labels=False
, then qcut
will return a Series with the integer indicators of the bins as values.
So to future-proof your code, you could use
data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)
or, pass a NumPy array to pd.qcut
so you get a Categorical as the return value.
Note that the Categorical attribute labels
is deprecated. Use codes
instead:
data3['bins_spd'] = pd.qcut(data3['spd_pct'].values, 5).codes
EDIT: The below answer is only valid for versions of Pandas less than 0.15.0. If you are running Pandas 15 or higher, see:
data3['bins_spd'] = pd.qcut(data3['spd_pct'], 5, labels=False)
Thanks to @unutbu for pointing it out. :)
Say you have some data that you want to bin, in my case options spreads, and you want to make a new variable with the buckets corresponding to each observation. The link mentioned above that you can do this by:
print pd.qcut(data3['spd_pct'], 40)
(0.087, 0.146]
(0.0548, 0.087]
(0.146, 0.5]
(0.146, 0.5]
(0.087, 0.146]
(0.0548, 0.087]
(0.5, 2]
which gives you what the bin endpoints are that correspond to each observation. However, if you would like the corresponding bin numbers for each observation then you can do this:
print pd.qcut(data3['spd_pct'],5).labels
[2 1 3 ..., 0 1 4]
Putting it all together if you would like to create a new variable with just the bin numbers, this should suffice:
data3['bins_spd']=pd.qcut(data3['spd_pct'],5).labels
print data3.head()
secid date symbol symbol_flag exdate last_date cp_flag 0 5005 1/2/1997 099F2.37 0 1/18/1997 NaN P
1 5005 1/2/1997 09B0B.1B 0 2/22/1997 12/3/1996 P
2 5005 1/2/1997 09B7C.2F 0 2/22/1997 12/11/1996 P
3 5005 1/2/1997 09EE6.6E 0 1/18/1997 12/27/1996 C
4 5005 1/2/1997 09F2F.CE 0 8/16/1997 NaN P
strike_price best_bid best_offer ... close volume_y return 0 7500 2.875 3.2500 ... 4.5 99200 0.074627
1 10000 5.375 5.7500 ... 4.5 99200 0.074627
2 5000 0.625 0.8750 ... 4.5 99200 0.074627
3 5000 0.125 0.1875 ... 4.5 99200 0.074627
4 7500 3.000 3.3750 ... 4.5 99200 0.074627
cfadj_y open cfret shrout mid spd_pct bins_spd
0 1 4.5 1 57735 3.06250 0.122449 2
1 1 4.5 1 57735 5.56250 0.067416 1
2 1 4.5 1 57735 0.75000 0.333333 3
3 1 4.5 1 57735 0.15625 0.400000 3
4 1 4.5 1 57735 3.18750 0.117647 2
[5 rows x 35 columns]
Hope this helps somebody else. At the very least it should be easier to search for now. :)