Performance and caching of generated code in Scala

2019-07-07 18:49发布

问题:

I need to generate an implementation of a trait at runtime, then execute a known method on an instance of the trait. In this example I'm running A's a method:

import reflect.runtime._, universe._, tools.reflect.ToolBox

package p {
  trait A { def a: String }
  val tb = currentMirror.mkToolBox()
  val d: A = tb.eval(q"""class C extends p.A { def a = "foo" }; new C""").asInstanceOf[A]
  println(d.a) // "foo"  
}

A few questions around this:

  1. The use case is very performance-sensitive (running generated queries in a data mgmt system) - is tb.eval generating the same bytecode that scalac would be generating had this been compiled at compile-time instead of runtime?
  2. I'd like to cache the generated class so I don't have to recompile it for known queries that have already been compiled. Can I get to the bytes of the generated class and store it in a class loader?
  3. Is there a more elegant way to do this, possibly avoiding asInstanceOf?

Update: adding details about my use case:

I'm working on the query system of a distributed columnar data store. We have an existing Scala-based query system which performs well. My goal is to compile incoming SQL queries down to Scala so it can run on the existing system.

I already built an interpretation-based version and it runs about 8x slower. I also began an ASM version, partially implementing SELECT, and its performance was on par with the Scala system (which was expected as they both result in near-identical bytecode).

The most important aspect of performance is running the dynamically-generated code because that cost is incurred on each machine that participates in a query across a cluster (currently a cluster is 60 machines but that continues to increase with the size of the dataset), and the generated code is used to scan billions of records. So I'm not too concerned with using reflection and code generation to produce the bytecode as long as it's reasonably performant.

The trait I need to implement is the interface for queries. And actually it's an abstract class to make it easier to work with Java. Here's a heavily-simplified example:

abstract class BaseQuery[R <: Result[R]] {
  def init(parameters: Option[JSONObject])
  def execute(partitionKey: String, subpartitionKey: String, numSubpartitions: Int, page: ColumnSet, referenceData: Map[String, Any]): Option[R]
}

After generating some bytecode, I need to package it up in a jar and ship it to other nodes so they can run the query on their respective partitions, then merge the results (shipping jars and merging results is already supported by the existing query system).

I'm looking at Scala's Quasiquotes support with the goal of making it much easier to express code generation. ASM is very low level, error prone, hard to debug, etc. Open to other options, but quasiquotes looked good. Additionally, I see that the Spark SQL project is using it.