What parameters should I use in VW for a binary classification task? For example, let's use rcv1_small.dat. I thought it is better to use the logistic loss function (or hinge) and it makes no sense to use --oaa 2
. However, the empirical results (with progressive validation 0/1 loss reported in all 4 experiments) show that best combination is --oaa 2
without logistic loss (i.e. with the default squared loss):
cd vowpal_wabbit/test/train-sets
cat rcv1_small.dat | vw --binary
# average loss = 0.0861
cat rcv1_small.dat | vw --binary --loss_function=logistic
# average loss = 0.0909
cat rcv1_small.dat | sed 's/^-1/2/' | vw --oaa 2
# average loss = 0.0857
cat rcv1_small.dat | sed 's/^-1/2/' | vw --oaa 2 --loss_function=logistic
# average loss = 0.0934
My primary question is: Why --oaa 2
does not give exactly the same results as --binary
(in the above setting)?
My secondary questions are: Why optimizing the logistic loss does not improve the 0/1 loss (compared to optimizing the default square loss)? Is this a specific of this particular dataset?
I have experienced something similar while using --csoaa
. The details could be found here. My guess is that in case of multiclass problem with N classes (no matter that you specified 2 as a number of classes) vw virtually works with N copies of features. Same example gets different ft_offset value when it's predicted/learned for every possible class and this offset is used in hashing algorithm. So all classes get "independent" set of features from same dataset's row. Of course feature values are same, but vw doesn't keep values - only feature weights. And weights are different for each possible class. And as amount of RAM used for storing these weights is fixed with -b
(-b 18
by default) - the more classes you have the more chance to get a hash collision. You can try to increase -b
value and check if difference between --oaa 2
and --binary
results is decreasing. But I might be wrong as I didn't go too deep into the vw code.
As for loss function - you can't compare avg loss values of squared (default) and logistic loss functions directly. You shall get raw prediction values from result obtained with squared loss and get loss of these predictions in terms of logistic loss. The function will be: log(1 + exp(-label * prediction)
where label is a priori known answer. Such functions (float getLoss(float prediction, float label)
) for all loss functions implemented in vw could be found in loss_functions.cc. Or you can preliminary scale raw prediction value to [0..1] with 1.f / (1.f + exp(- prediction)
and then calc log loss as described on kaggle.com :
double val = 1.f / (1.f + exp(- prediction); // y = f(x) -> [0, 1]
if (val < 1e-15) val = 1e-15;
if (val > (1.0 - 1e-15)) val = 1.0 - 1e-15;
float xx = (label < 0)?0:1; // label {-1,1} -> {0,1}
double loss = xx*log(val) + (1.0 - xx) * log(1.0 - val);
loss *= -1;
You can also scale raw predictions to [0..1] with '/vowpal_wabbit/utl/logistic' script or --link=logistic
parameter. Both use 1/(1+exp(-i))
.