System time for parallel and serial processing

I'm running a Bayesian MCMC probit model, and I'm trying to implement it in parallel. I'm getting confusing results about the performance of my machine when comparing parallel to serial. I don't have a lot of experience doing parallel processing, so it is possible I'm not doing it right.

I'm using MCMCprobit in the MCMCpack package for the probit model, and for parallel processing I'm using parLapply in the parallel package.

Here's my code for the serial run, and the results from system.time:

system.time(serial<-MCMCprobit(formula=econ_model,data=mydata,mcmc=10000,burnin=100))

   user  system elapsed 
 657.36   73.69  737.82

Here's my code for the parallel run:

#Setting up the functions for parLapply:
probit_modeling <- function(...) {
  args <- list(...)
  library(MCMCpack)
  MCMCprobit(formula=args$model, data=args$data, burnin=args$burnin, mcmc=args$mcmc, thin=1)
}

probit_Parallel <- function(mc, model, data,burnin,mcmc) {
  cl <- makeCluster(mc)
  ## To make this reproducible:
  clusterSetRNGStream(cl, 123)
  library(MCMCpack) # needed for c() method on master
  probit.res <- do.call(c, parLapply(cl, seq_len(mc), probit_modeling, model=model, data=data, 
                                        mcmc=mcmc,burnin=burnin))
  stopCluster(cl)
  return(probit.res)
}


system.time(test<-probit_Parallel(model=econ_model,data=mydata,mcmc=10000,burnin=100,mc=2))

And the results from system.time:

   user  system elapsed 
   0.26    0.53 1097.25

Any ideas why user and system times would be so much shorter for the parallel process, but the elapsed time so much longer? I tried it at shorter MCMC runs (100 and 1000), and the story is the same. I'm assuming I'm making a mistake somewhere.

Here are my computer specifications:

R 3.1.3
8 GB memory
Windows 7 64 bit
Intel Core i5 2520M CPU, dual core

It appears to me that both of the workers are doing as much work as is performed in the sequential version. The workers should only perform a fraction of the total work in order to execute faster than the sequential version of the code. That might be accomplished by dividing mcmc by the number of workers in this example, although that may not be what you real want to do.

I think that explains the long elapsed time reported by system.time. The "user" and "system" times are short because they are times for the master process which uses very little CPU time when executing parLapply: the real CPU time is used by the workers which isn't being reported by system.time.