Positives/negatives proportion in train set

I'm trying to get Rocchio algorithm for relevance feedback to work. I have a query, and a few documents marked positives and negatives. For example, I have 60 positives and 337 negatives. I want to train my model(in this case - adjust the query) using part of this dataset and test it on the other part. But having this kind of imbalanced dataset i'm not sure how many negatives and how many positives to take into training set.

Another problem is that depending on the positives/negatives proportion in test dataset I get misleading Precision, Recall and F1-score results. Having 49 positives and 17 negatives in test dataset gives me Precision=0.742, Recall=1.000 and F1=0.852, with number of TP=49, FP=17, TN=0, FN=0.

Distribution of positives/negatives proportion for other queries doesnt give me any hint on which proportion to choose for my model.

So what im asking you for is some advice on working with imbalanced datasets to get correct results.

Thanks in advance, sorry for such a noob(-ish?) question :-)

标签： machine-learning information-retrieval

1条回答

小情绪 Triste *

2楼-- · 2019-06-08 00:32

First of all, I think that your algorithm will have a hard time generalizing from such a little number of examples (This depends on the number of features as well of course).

Secondly, I don't think that it is a very good idea to work with an imbalanced dataset. It seems that your algorithm hasn't learned anything since its output is always "positive". This means that if your dataset was balanced you would have a 50% accuracy. Not too good... If you cannot find a larger dataset, I would suggest that you split yours as such:

Training set (45 positives / 45 negatives)
Test set (15 positives / 15 negatives)

Anyway, I am still a student so that is what I think but it would be good if a more experienced user could confirm or infirm.

Hope it help!

0人赞添加讨论(0) 举报

Positives/negatives proportion in train set

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间