How to give dplyr a SQL query and have it return a

2019-08-01 19:47发布

问题:

Say I have a remote tbl open using dbplyr, and I want to use a SQL query on it (maybe because there's not dbplyr translation for what I want to do), how do I give it such that it returns a remote tbl object?

The DBI::dbGetQuery() function allows you to give a query to db, but it returns a data frame on memory, and not an remote tbl object.

For example, say you already have a connection con open to a db, you can create a table like this:

library(tidyverse)

x_df <- expand.grid(A = c('a','b','c'), B = c('d','e','f', 'g','h')) %>% 
  mutate(C = round(rnorm(15), 2))

DBI::dbWriteTable(conn = con,
                  name = "x_tbl",
                  value = x_df,
                  overwrite = TRUE)

x_tbl = tbl(con, 'x_tbl')

sql_query <- build_sql('SELECT a, b, c, avg(c) OVER (PARTITION BY a) AS mean_c FROM x_tbl')

y_df <- DBI::dbGetQuery(con, sql_query) # This returns a data frame on memory

y_tbl <- x_tbl %>% 
  group_by(a) %>% 
  mutate(mean_c = mean(c))

show_query(y_tbl) # This is a remote tbl object

In this case, I could just use y_tbl. But there are cases in which the function has not been translated on dbplyr (for example, quantile doesn't work), and I need to use SQL code. But I don't want to collect the result, I want it to create a remote tbl object. Is there a way I can give a SQL query (like with dbGetQuery()) but have it return a remote tbl?

Thank you

回答1:

Well, playing with how it works, I think I found a way. You can give a sql query inside the mutate function:

y_tbl <- x_tbl %>% 
  group_by(a) %>% 
  mutate(mean_c = sql("avg(c) OVER (PARTITION BY a)"))

show_query(y_tbl) # This is a remote tbl object

This will let you give a SQL definition of a variable without having to compute the table too.



回答2:

As I understand it, there are a collection of standard translations that dbplyr makes from dplyr to SQL. Anything that falls outside this translation is left as is.

For example, DATEFROMPARTS is an SQL function but not an R function. I commonly use the following mutate:

y_tbl <- x_tbl %>% 
    mutate(new_date = DATEFROMPARTS(year_col, month_col, day_col)

And because there is no defined translation from an R function DATEFROMPARTS to an SQL function (because the R function does not exist in dplyr) it is left as is.