How to label special cases in RandomForestRegresso

2019-08-26 04:49发布

问题:

I have a set of numerical features (f1, f2, f3, f4, f5) as follows for each user in my dataset.

       f1   f2  f3  f4  f5
user1  0.1  1.1  0 1.7  1
user2  1.1  0.3  1 1.3  3
user3  0.8  0.3  0 1.1  2
user4  1.5  1.2  1 0.8  3
user5  1.6  1.3  3 0.3  0

My target output is a prioritised user list. i.e. as shown in the example below.

       f1   f2  f3  f4  f5  target_priority
user1  0.1  1.1  0 1.7  1       2
user2  1.1  0.3  1 1.3  3       1
user3  0.8  0.3  0 1.1  2       5
user4  1.5  1.2  1 0.8  3       3
user5  1.6  1.3  3 0.3  0       4

I want to use these features in a way that reflect the priority of the user. Currently, I am using sklearnsRandomForestRegressor` to perform this task.

However, I got my real dataset recently and it has some users with no priority label. That is because such users are not important to our company (more like general users).

Example (How real dataset looks like):

       f1   f2  f3  f4  f5  target_priority
user1  0.1  1.1  0 1.7  1       2
user2  1.1  0.3  1 1.3  3       2
user3  0.8  0.3  0 1.1  2       N/A
user4  1.5  1.2  1 0.8  3       N/A
user5  1.6  1.3  3 0.3  0       1

In such special cases (that does not have a priority label), is it good to give them a special symbol or a priority level that is much much lower than the existing priorities (e.g., 100000000000000000 priority)? How does such special cases are handled in RandomForestRegressor?

I am happy to provide more details if needed?

回答1:

Okay if 80-90% don't need a priority, you should build a classifier that decides whether the priority needs to be assigned or not, since this would be a skewed class, I would recommend you to use Decision tree or Anomaly Detection as classifier, data points that require priority will be an anomaly, you can use Sklearn for these.

After deciding the objects that have to be assigned a priority, I will look into the distribution of the training data with respect to priorities, you said that priorities range from 1-100, so if you have at least 5,000 data points and each priority level has at least 35 examples, I would suggest Multi class classifier(SVC with rbf kernel is preferred) and confusion matrix for checking the accuracy of the matrix, if that doesn't work You will have to use a regressor on the data and then round the answer.

What I basically mean is that if the data is huge enough, and there is an even distribution among target label, go for Multiclass classification, if the data is not big enough go for a classifier, If you want code for any part of it, let me know.

Edit for Code

Ok so let's take it from the top, firstly either in your target the N.A. values are stored as np.nan or they are stored as symbols like ? or straight up text like N.A. in all cases this will result in your target label being of type object, to check use df[['target']].dtypes if it says int or float, you can skip the first step, but if it says object, then we need to fix that first.

df.loc[df['Target'] == 'N.A.', 'Target'] = np.nan #np = Numpy `N.A.` can be any placeholder that is being used by tour dataset for N.A. values.
df[['target']] = df[['target']].astype(float)

Now let's move to part two, where you need to get the target for your classifier, to do that use

df2 = pd.DataFrame()
df2['Bool'] = df[['Target']] != np.nan
df1 =  pd.concat([df, df2], axis = 1)
df1.head() #Sanity check

This will update your dataframe by adding true whenever a priority was assigned, this column will be you target for your classifier. Notice using df1 and not df, now drop the Target from df1 as it is not important, for the first part. df1.drop(['Target'], axis = 1, inplace = True)

Now I am going to use random forest classification in this since Anomaly detection should be avoided till classes are skewed upto 98%, but you can look at it here.

Moving on, to build the random forest classifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2) #Note max depth is a hyper parameter and you will need to tune it.
clf.fit (df1.drop(['Bool'],axis=1),df1['Bool'])

To drop the rows where the output is false

df1 = df1[df['Bool'] == True]

Then just use clf.predict() on the new data. Drop the rows where the output comes as false and run a regressor on the remaining data. I am assuming you can do the regressor part as that is now completely straight forward. Let me know if you face any further issues.