Is there a way to send data to train a model in Vowpal Wabbit without writing it to disk?
Here's what I'm trying to do. I have a relatively large dataset in csv (around 2gb) which fits in memory with no problem. I load it in R into a data frame, and I have a function to convert the data in that dataframe into VW format.
Now, in order to train a model, I have to write the converted data to a file first, and then feed that file to VW. And the writing to disk part takes way too long, especially since I want to try different various models with different feature transformations, and thus I have to write the data to disk multiple times.
So, assuming I'm able to create a character vector in R, in which each element is a row of data in VW format, how could I feed that into VW without writing it to disk?
I considered using the daemon mode and writing the character vector to a localhost connection, but I couldn't get VW to train in daemon mode -- I'm not sure this is even possible.
I'm willing to use c++ (through the Rcpp package) if necessary to make this work.
Thank you very much in advance.
UPDATE:
Thank you everyone for your help. In case anyone's interested, I just piped the output to VW as suggested in the answer, like so:
# Two sample rows of data
datarows <- c("1 |name 1:1 2:4 4:1", "-1 |name 1:1 4:1")
# Open connection to VW
con <- pipe("vw -f my_model.vw")
# Write to connection and close
writeLines(datarows, con)
close(con)
Vowpal Wabbit supports reading data from standard input (cat train.dat | vw), so you can open a pipe directly from R.
Daemon mode supports training. If you need incremental/contiguous learning, you can use a trick with a dummy example whose tag starts with string "save". Optionally you can specify the model filename as well:
1 save_filename|
Yet another option is to use VW as library, see an example.
Note that VW supports various feature engineering using feature namespaces.
What you may be looking for is running vw
in daemon mode.
The standard way to do this is to run vw
as a daemon:
vw -i some.model --daemon --quiet --port 26542 -p /dev/stdout
You may replace 26542
by the port of your choice.
Now you can TCP connect to the server (which can be localhost
, on port 26542
) and every request you write to the TCP socket, will be responded to on the same socket.
You can both learn (send labeled examples, which will change the model in real-time) or write queries and read back responses.
You can do it either one query+prediction at a time or many at a time. All you need is a newline char at the end of each query, exactly as you would test from a file. Order is guaranteed to be preserved.
You can also intermix requests to learn from with requests that are intended only for prediction and are not supposed to update the in memory model. The trick to achieve this is to use a zero-weight for examples you don't want to be learned from.
This example will update the model because it has a weight of 1:
label 1 'tag1| input_features...
And this one won't update the model because it has a weight of 0:
label 0 'tag2| input_features...
A bit more in the official reference is in the vowpal wabbit wiki:
How to run vowpal wabbit as a daemon although note that in that main example a model is pre-learned and loaded into memory.
I am also using R to transform data and output them to VowpalWabbit. There exists RVowpalWabbit
package on CRAN which can be used to connect R with VowpalWabbit. However,
it is only available on Linux.
Also, to speed things up, I use fread
function of data.table
package. Transformations of data.table
are also quicker than in data.frame
, but one needs to learn a different syntax.