parlapply on sqlQuery from RODBC

2019-04-13 06:53发布

问题:

R Version : 2.14.1 x64 Running on Windows 7 Connecting to a database on a remote Microsoft SQL Server 2012

I have an unordered vectors of names, say:

names<-c(“A”, “B”, “A”, “C”,”C”)

each of which have an id in a table in my db. I need to convert the names to their corresponding ids.

I currently have the following code to do it.

###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects

nameToID<-function(name, dbConn){
                #dbConn : active db connection formed via odbcDriverConnect
                #name     : a char string

                sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
sapply(names, nameToID, dbConn=dbConn)
###

Barring better ways to do this, which could involve loading the table into R then working with the problem there (which is possible), I understand why the following doesn’t work, but I cannot seem to find a solution. Attempting to use parallelization via the package ‘parallel’ :

###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects

nameToID<-function(name, dbConn){
                #dbConn : active db connection formed via odbcDriverConnect
                #name     : a char string

                sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}

mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “dbConn”))
parSapply(cl, names, nameToID, dbConn=dbConn)    #incorrect passing of nameToID’s second argument
###

As in the comment, this is not the correct way to assign the second argument to nameToID.

I have also tried the following:

parSapply(cl, names, function(x) nameToID(x, dbConn))

in place of the previous parSapply call, but that also does not work, with the error being thrown saying “the first parameter is not an open RODBC connection”, presumably referring to the first parameter of the sqlQuery(). dbConn remains open though

The following code does work with parallization.

###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name){
                #name     : a char string
                dbConn<-odbcDriverConnect(connection=”string”)
                result<-sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
                odbcClose(dbConn)
                result
}

mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “odbcDriverConnect”, “odbcClose”, “dbConn”, “nameToID”))      #throwing everything in
parSapply(cl, names, nameToID)
###

But the constant opening and closing of the connection ruins the gains from parallelization, and seems just a bit silly.

So the overall question would be how to pass the second parameter (the open db connection) to the function within parSapply, in much the same way as it is done in the regular apply? In general, how does one pass a second, third, nth parameter to a function within a parallel routine?

Thanks and if you need any more information let me know.

-DT

回答1:

Database connection objects can't be exported or passed as function arguments because they contain socket connections. If you try, it will be serialized, sent to the workers and deserialized, but it won't work correctly since the socket connection won't be valid.

The solution is to create the database connection on each worker before calling parSapply. I often do that using clusterEvalQ:

clusterEvalQ(cl, {
    library(RODBC)
    dbConn <- odbcDriverConnect(connection="connection string")
    NULL
})

Now the worker function can be written as:

nameToID <- function(name) {
    sqlQuery(dbConn, paste("select id from table where name='", name, "'", sep=""))
}

and called with:

parSapply(cl, names, nameToID)  

Also note that since RODBC is loaded on each of the workers you don't have to export functions defined in it, which I think is good programming practice.