R -> kdb: Pass R data to kdb+ as binary objects

2019-07-20 00:43发布

问题:

What's the most efficient way to insert R objects (more specifically, time series expressed as xts or data.table objects, i.e. time-based and numeric columns) into a kdb+ database?

I was able to locate only solution involving string serialization via q expressions as described here and here.

回答1:

My solution was inspired by this version of qserver.c from github

Yang added two functions: convert_binary, convert_r that [de]serialized data, which is basically what you asked for. However, the return value is a hexadecimal array. To incorporate with existing execute function, we need to use paste(collapse="") to convert into a string, then use sprintf to execute. The following is the example, which will send robj in R to d in kdb:

execute(h, sprintf("d:-9!0x%s",paste(convert_r(robj),collapse="")))

The problem is that paste(collapse="") takes quite some time if the array is large.

robj is the r object. e.g. I tried it with a data.frame (dim = 60,000x100). convert_r() took < 0.5s to convert; paste(collapse="") took 13s to transform into a single string, then execute(h, ...) took < 1s to transfer the data.

I have not found anyone who has written a function sending R Data to kdb via serialized binary data (I don't know why), so I made one myself. Here is the code:

SEXP kx_r_send_data(SEXP connection, SEXP robj, SEXP varname)
{
  K result, conversion, serialized;
  kx_connection = INTEGER_VALUE(connection);
  conversion = from_any_robject(robj);
  serialized = b9(2, conversion);
  result = k(kx_connection, "{[d;v] v set -9!d;}", r1(serialized), ks((S)CHARACTER_VALUE(varname)), (K)0);
  SEXP s = from_any_kobject(result);
  r0(result);
  r0(conversion);
  r0(serialized);
  return s;
}

I assume you have the knowledge to modify the qserver.c and recompile qserver.o Then you add a function in qserver.R:

send_data <- function(connection, r_obj, varname) {
  .Call("kx_r_send_data", as.integer(connection), r_obj, varname)
}

That is the true way of sending R Data to kdb via serialized binary at C level.

Note:

1) the conversion doesn't work with data.table as it's not a standard R class. Calling the function with data.table will lead to segmentation fault.

2) Serialization doesn't know how to convert date/datetime type of object. Serialization will make it all 0N after transfer into kdb.

Unless you want to implement the date/datetime/data.table conversion from R to K, Do NOT call convert_r() or send_data() functions for those types.

On the other hand, there is a quick workaround. For data.table, simply use as.data.frame to convert it to data.frame class before calling the functions. For date/datetime class, use as.character() to convert into string before sending to kdb. Then cast to "D" or "P" inside KDB directly.

3) serializing data.frame includes other information such as rows, row name, class info, etc. You need to manipulate the data inside kdb after the transfer.

I would suggest writing an R wrapper function that handles those abnormal cases, then call send_data() to pass the data to kdb. Then use execute(h, ...) to manipulate the data into a standard format inside kdb.

The same data (60,000x100) now takes < 1s to finish, end-to-end from R to kdb.

PS> I may have a typo inside the code as I don't know how to paste pretty code up here. I actually typed it out instead. Let me know if you found any critical typo within the code



回答2:

The most "stable" way to interact with kdb from R is to use the string query interface. If you want actual object [de]serialisation then suggest you look at the C interface and call that lib from R to interact with KDB.



标签: r ipc kdb