I have started using Vowpal Wabbit for logistic regression, however I am unable to reproduce the results it gives. Perhaps there is some undocumented "magic" it does, but has anyone been able to replicate / verify / check the calculations for logistic regression?
For example, with the simple data below, we aim to model the way age
predicts label
. It is obvious there is a strong relationship as when age increases the probability of observing 1 increases.
As a simple unit test, I used the 12 rows of data below:
age label
20 0
25 0
30 0
35 0
40 0
50 0
60 1
65 0
70 1
75 1
77 1
80 1
Now, performing a logistic regression on this dataset, using R , SPSS or even by hand, produces a model which looks like L = 0.2294*age - 14.08
. So if I substitude the age, and use the logit transform prob=1/(1+EXP(-L)) I can obtain the predicted probabilities which range from 0.0001
for the first row, to 0.9864
for the last row, as reasonably expected.
If I plug in the same data in Vowpal Wabbit,
-1 'P1 |f age:20
-1 'P2 |f age:25
-1 'P3 |f age:30
-1 'P4 |f age:35
-1 'P5 |f age:40
-1 'P6 |f age:50
1 'P7 |f age:60
-1 'P8 |f age:65
1 'P9 |f age:70
1 'P10 |f age:75
1 'P11 |f age:77
1 'P12 |f age:80
And then perform a logistic regression using
vw -d data.txt -f demo_model.vw --loss_function logistic --invert_hash aaa
(command line consistent with How to perform logistic regression using vowpal wabbit on very imbalanced dataset ) , I obtain a model L= -0.00094*age - 0.03857
, which is very different.
The predicted values obtained using -r
or -p
further confirm this. The resulting probabilities end up nearly all the same, for example 0.4857
for age=20, and 0.4716
for age=80, which is extremely off.
I have noticed this inconsistency with larger datasets too. In what sense is Vowpal Wabbit carrying out the logistic regression differently, and how are the results to be interpreted?
This is a common misunderstanding of vowpal wabbit.
One cannot compare batch learning with online learning.
vowpal wabbit is not a batch learner. It is an online learner. Online learners learn by looking at examples one at a time and slightly adjusting the weights of the model as they go.
There are advantages and disadvantages to online learning. The downside is that convergence to the final model is slow/gradual. The learner doesn't do a "perfect" job at extracting information from each example, because the process is iterative. Convergence on a final result is deliberately restrained/slow. This can make online learners appear weak on tiny data-sets like the above.
There are several upsides though:
vw
to learn from billions of examples data-sets on their desktops and laptops.Online learners are very sensitive to example order. The worst possible order for an online learner is when classes are clustered together (all, or almost all,
-1
s appear first, followed by all1
s) like the example above does. So the first thing to do to get better results from an online learner like vowpal wabbit, is to uniformly shuffle the1
s and-1
s (or simply order by time, as the examples typically appear in real-life).OK now what?
Q: Is there any way to produce a reasonable model in the sense that it gives reasonable predictions on small data when using an online learner?
A: Yes, there is!
You can emulate what a batch learner does more closely, by taking two simple steps:
1
and-1
examples.Caveat: if you run multiple passes until error goes to 0, there's a danger of over-fitting. The online learner has perfectly learned your examples, but it may not generalize well to unseen data.
The second issue here is that the predictions
vw
gives are not logistic-function transformed (this is unfortunate). They are akin to standard deviations from the middle point (truncated at [-50, 50]). You need to pipe the predictions viautl/logistic
(in the source tree) to get signed probabilities. Note that these signed probabilities are in the range [-1, +1] rather than [0, 1]. You may uselogistic -0
instead oflogistic
to map them to a [0, 1] range.So given the above, here's a recipe that should give you more expected results:
Giving this more expected result on your data-set:
You could make the results more/less polarized (closer to
1
on the older ages and closer to-1
on the younger) by increasing/decreasing the number of passes. You may also be interested in the following options for training:For example, by increasing the learning rate from the default
0.5
to a large number (e.g.10
) you can forcevw
to converge much faster when training on small data-sets thus requiring less passes to get there.Update
As of mid 2014,
vw
no longer requires the externallogistic
utility to map predictions back to [0,1] range. A new--link logistic
option maps predictions to the logistic function [0, 1] range. Similarly--link glf1
maps predictions to a generalized logistic function [-1, 1] range.