Interpreting Vowpal Wabbit results: Why are some l

2019-06-21 20:34发布

问题:

Below is part of the log from training my VW model.

Why are some of these lines followed by h? You'll notice that's true of the "average loss" line in the summary at the end. I'm not sure what this means, or if I should care.

...  
average    since         example     example  current  current  current    
loss       last          counter      weight    label  predict features    
1.000000   1.000000            1         1.0  -1.0000   0.0000       15    
0.500000   0.000000            2         2.0   1.0000   1.0000       15    
1.250000   2.000000            4         4.0  -1.0000   1.0000        9    
1.167489   1.084979            8         8.0  -1.0000   1.0000       29    
1.291439   1.415389           16        16.0   1.0000   1.0000       45    
1.096302   0.901166           32        32.0  -1.0000  -1.0000       21    
1.299807   1.503312           64        64.0  -1.0000   1.0000        7    
1.413753   1.527699          128       128.0  -1.0000   1.0000       11    
1.459430   1.505107          256       256.0  -1.0000   1.0000       47    
1.322658   1.185886          512       512.0  -1.0000  -1.0000       59    
1.193357   1.064056         1024      1024.0  -1.0000   1.0000       69    
1.145822   1.098288         2048      2048.0  -1.0000  -1.0000        5    
1.187072   1.228322         4096      4096.0  -1.0000  -1.0000        9    
1.093551   1.000031         8192      8192.0  -1.0000  -1.0000       67    
1.041445   0.989338        16384     16384.0  -1.0000  -0.6838       29    
1.107593   1.173741        32768     32768.0   1.0000  -1.0000        5    
1.147313   1.187034        65536     65536.0  -1.0000   1.0000        7    
1.078471   1.009628       131072    131072.0  -1.0000  -1.0000       73    
1.004700   1.004700       262144    262144.0  -1.0000   1.0000       41 h  
0.918594   0.832488       524288    524288.0  -1.0000  -1.0000        7 h  
0.868978   0.819363      1048576   1048576.0  -1.0000  -1.0000       21 h  

finished run  
number of examples per pass = 152064  
passes used = 10  
weighted example sum = 1.52064e+06  
weighted label sum = -854360  
average loss = 0.809741 h  
...

Thanks

回答1:

This h is printed when

(!all.holdout_set_off && all.current_pass >= 1)

is true (see output from grep -nH -e '\<h\\n' vowpalwabbit/*.cc and view the code).

Search for --holdout_off in Command line arguments:

--holdout_off disables holdout validation for multiple pass learning. By default, VW holds out a (controllable default = 1/10th) subset of examples whenever --passes > 1 and reports the test loss on the print out. This is used to prevent overfitting in multiple pass learning. An extra h is printed at the end of the line to specify the reported losses are holdout validation loss, instead of progressive validation loss.



回答2:

VW uses samples from file to train model and prints out avg train loss value (without 'h' suffix) while that. In case several passes (specified with --passes n) over file are needed to train the model it keeps every k'th example (could be changed with --holdout_period k) for test and don't use them for training. On a second and further passes it estimates loss on these test examples rather than train examples and prints out loss value with ' h'. If you are getting very small values without 'h' and much bigger values with 'h' later that could mean your model is overfitted. If have already ensured your model don't overfit and want to use multiple passes over whole dataset for training you shall specify --holdout_off. Otherwise you'll lost 10% of data (--holdout_period is 10 by default).