I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.
Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.
Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.
I tried various settings for:
- driver-memory (8G)
- executor-memory (1-225G)
- spark.driver.maxResultSize (including disabling it)
- spark.memory.offheap.enabled (true/false)
- spark.broadcast.blockSize (currently at 8m)
- spark.rdd.compress (currently true)
- changing the serializer (currently Kryo) and its max buffer (512m)
- increasing various timeouts to allow for longer computation (executor.heartbeatInterval, rpc.ask/lookupTimeout, spark.network.timeout) spark.akka.frameSize (1024)
At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.
Questions:
- Does my prediction routine look OK? Is this an off-by-one error somewhere w.r.t the irrelevant predicted topics?
- Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.
Any pointers as to what I can try next would be much appreciated!