My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a super store chain that operates stores in different cities)
city,avg_income_in_city,population,square_feet_of_store_area, store_type ,avg_revenue
NY ,54504 , 3506908 ,3006 ,INDOOR , 8000091
CH ,44504 , 2505901 ,4098 ,INDOOR , 4000091
HS ,50134 , 3206911 ,1800 ,KIOSK , 7004567
NY ,54504 , 3506908 ,1000 ,KIOSK , 2000091
Her you can see that avg_income_in_city, square_feet_of_store_area and avg_revenue are continuous values where as city,store_type etc are categorical classes (and few more which I have not shown here to maintain the brevity of the data).
I wish to model the data in order to predict the revenue. The question is how to 'Discretizate' the continuous values using sklearn? Does sklearn provide any "readymade" class/method for Discretization of the continuous values? (like we have in Orange e.g Orange.Preprocessor_discretize(data, method=orange.EntropyDiscretization())
Thanks !
You may also consider rendering the Categorical variables numerical, e.g. via indicator variables, a procedure also known as one hot encoding
Try
and fit it to your categorical data, followed by a numerical estimation method such as linear regression. As long as there aren't too many categories (city may be a little too much), this can work well.
As for discretization of continuous variables, you may consider binning using an adapted bin size, or, equivalently, uniform binning after histogram normalization.
numpy.histogram
may be helpful here. Also, while Fayyad-Irani clustering isn't implemented insklearn
, feel free to check outsklearn.cluster
for adaptive discretizations of your data (even if it is only 1D), e.g. via KMeans .Update (Sep 2018): As of version
0.20.0
, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:
which produces:
Note that, by default,
pd.cut
returns a pd.Series object of dtypeCategory
with elements of typeinterval[int64]
. If you specify your ownlabels
, the dtype of the output will still be aCategory
, but the elements will be of typeint64
. If you want the series to have a numeric dtype, you can use.astype(np.int64)
.My example uses integer data, but it should work just as fine with floats.
you could using pandas.cut method, like this:
The answer is no. There is no binning in scikit-learn. As eickenberg said, you might want to use np.histogram. Features in scikit-learn are assumed to be continuous, not discrete. The main reason why there is no binning is probably that most of sklearn is developed on text, image featuers or dataset from the scientific community. In these settings, binning is rarely helpful. Do you know of a freely available dataset where binning is really beneficial?