Duplicating training examples to handle class imba

I have a DataFrame in pandas that contain training examples, for example:

   feature1  feature2  class
0  0.548814  0.791725      1
1  0.715189  0.528895      0
2  0.602763  0.568045      0
3  0.544883  0.925597      0
4  0.423655  0.071036      0
5  0.645894  0.087129      0
6  0.437587  0.020218      0
7  0.891773  0.832620      1
8  0.963663  0.778157      0
9  0.383442  0.870012      0

which I generated using:

import pandas as pd
import numpy as np

np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
    'feature1': np.random.random(number_of_samples),
    'feature2': np.random.random(number_of_samples),
    'class':    np.random.binomial(2, 0.1, size=number_of_samples), 
    },columns=['feature1','feature2','class'])

print(frame)

As you can see, the training set is imbalanced (8 samples have class 0, while only 2 samples have class 1). I would like to oversample the training set. Specifically, I would like to duplicating training samples with class 1 so that the training set is balanced (i.e., where the number of samples with class 0 is approximately the same as the number of samples with class 1). How can I do so?

Ideally I would like a solution that may generalize to a multiclass setting (i.e., the integer in the class column may be more than 1).

标签： python pandas machine-learning oversampling

1条回答

一夜七次

2楼-- · 2019-02-19 09:28

You can find the maximum size a group has with

max_size = frame['class'].value_counts().max()

In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size) elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.

lst = [frame]
for class_index, group in frame.groupby('class'):
    lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)

You can play with max_size-len(group) and maybe add some noise to it because this will make all group sizes equal.

0人赞添加讨论(0) 举报

Duplicating training examples to handle class imba

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间