可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

We have a use case, where we are downloading large volumes (order of 100 gigabytes per day) of data from hundreds of data sources, massaging and processing this data and then exposing this data to our customers via RESTful API. Today the base data size is ca. 20TB and expected to grow heavily in the future.

For the massaging/processing part, we believe spark can be a very good choice for us. Now for exposing processed/massaged data through an API, one option is to store processed data to a read only database like ElephantDB and make web services to talk to ElephantDB (at least this is how Nathan has proposed in his Big Data book). I was just wondering what would be the implication of we make web services implementation to use SparkSQL to access processed data from Spark. What could be the architecture/design dangers in this case?

Every body is talking about Spark is fast and what not and using SparkSQL for interactive queries. But is it already in a stage to serve large volume of web services queries via SparkSQL where we have very strict SLA for latency serve hundreds and thousands of web services requests per second? If Apache Spark could handle this, we could avoid maintaining yet another system like ElephantDB or Cassandra or what not.

Would like to hear from the experts on this board.

回答1:

If the results are stored in files, you have no indexes, and SparkSQL also doesn't create indexes. The only thing that can be somewhat fast is reading columns from Parquet files and caching tables.

But in general it's not a good use case to use SparkSQL to serve web requests simply because Spark wasn't made for that.

回答2:

So you're batch processing the raw data, yes? The ideal way would be to store the outcome on a key-value format, as you mention with ElephandDB, and also project Voldemort has been shown to be a good fit as read-only storage.

I recommend you to read this article (combining batch and realtime layers) by Nathan Marz: How to beat the CAP theorem

It has however been questioned by Jay Kreps in his article Questioning the Lambda Architecture. The main concern (with the lambda architecture) is that there is problematic to maintain the "same" system logic in different distributed systems to produce the same result.

But since you are using Spark, you can use the same logic with Spark Streaming. Which was not "in the market" when Nathan Marz and Jay Kreps wrote their articles.

You can still use SparkSQL to query the raw data interactively, but since Spark was first implemented as scheduled batch jobs, this will not be the perfect use case. But as you've probably noticed, is that it takes some time to submit spark jobs, this is an overhead that "kills" the idea of fast queries.

Please look into github.com/spark-jobserver/spark-jobserver, the job-server supports sub-second low-latency jobs via long-running job contexts. And can share Spark RDDs between different jobs, which can be proved to be very optimized for different interactive logic on the same dataset. Combine machine learning result and ad-hoc (SparkSQL) queries via HTTP requests. Read more about spark job-server, there are some talks about it online on different Spark Summits.