How to send a data.frame from R to Q/KDB?

2019-04-29 17:14发布

问题:

I have a large data.frame (15 columns and 100,000 rows) in an existing R session that I want to send to a Q/KDB instance. From KDB's cookbook, the possible solutions are:

RServer for Q: use KDB to create new R instance which shares memory space. This doesn't work because my data is in an existing instance of R.

RServe: run an R server and use TCP/IP to communicate with Q/KDB client. This does not work, because as per RServe's documentation, "every connection has a separate workspace and working directory" and so i presume does not see my existing data.

R Math Library: access R's functionality via a math library without needing an instance of R. This does not work because my data is already in an instance of R.

So any other ideas on how to send data from R to Q/KDB?

回答1:

open a port in Q. I start Q with a batch file:

@echo off
c:\q\w32\q -p 5001

load qserver.dll

tryCatch({
dyn.load("c:/q/qserver.dll")}
  ,error = function(f){
    print("can't load qserver.dll")
  })

Then use these

open_connection <- function(host="localhost", port=5001, user=NULL) {
         parameters <- list(host, as.integer(port), user)
      h <- .Call("kx_r_open_connection", parameters)
    assign(".k.h", h, envir = .GlobalEnv)
    return(h)
}

close_connection <- function(connection) {
         .Call("kx_r_close_connection", as.integer(connection))
}

execute <- function(connection, query) {
         .Call("kx_r_execute", as.integer(connection), query)
}

 d<<-open_connection(host="localhost",port=thePort)

ex2 <- function(...) 
{
  query <- list(...)
  theResult <- NULL
  for(i in query) theResult <- paste0(theResult,i)
  return(execute(d,paste0(theResult)))
}

then ex2 can take multiple arguments so you can build queries with R variables and strings

Edit: thats for R from Q, heres R to Q

2nd Edit: improved algo:

library(stringr)
  RToQTable <- function(Rtable,Qname,withColNames=TRUE,withRowNames=TRUE,colSuffix = NULL)
{
  theColnames <- if(!withColNames || length(colnames(Rtable))==0) paste0("col",as.character(1:length(Rtable[1,])),colSuffix) else colnames(Rtable)
  if(!withRowNames || length(rownames(Rtable))==0) withRowNames <- FALSE
  Rtable <- rbind(Rtable,"linesep")
  charnum <- as.integer(nchar(thestr <- paste(paste0(theColnames,':("',str_split(paste(Rtable,collapse='";"'),';\"linesep\";\"')[[1]],');'),collapse="")) - 11)
  if(withRowNames)
    ex2(Qname,":([]",Qname,str_replace_all(paste0("`",paste(rownames(Rtable),collapse="`"))," ","_"),";",.Internal(substr(thestr,1L,charnum)),"))") else
    ex2(Qname,":([]",.Internal(substr(thestr,1L,charnum)),"))")
}

> bigMat <- matrix(runif(1500000),nrow=100000,ncol=15)
> microbenchmark(RToQTable(bigMat,"Qmat"),times=3)
Unit: seconds
                      expr      min     lq     mean   median       uq      max neval
 RToQTable(bigMat, "Qmat") 10.29171 10.315 10.32766 10.33829 10.34563 10.35298     3

This will work for a matrix, so for a data frame just save a vector containing the types of each column, then convert the dataframe to a matrix, import the matrix to Q, and cast the types

Note that this algo is approx O(rows * cols^1.1) so you'll need to chop the columns up into multiple matricies if you have any more than 20 to get O(rows * cols)

but for your example 150,000 rows and 15 columns takes 10 seconds so further optimization may not be necessary.