I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what are the most significant advantages/differences between them? I've already read Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill) but this topic is still unclear for me.
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Java写文件至HDFS失败
- Livy Server: return a dataframe as JSON?
- mapreduce count example
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Sorting a data stream before writing to file in no
Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/
Drill is fundamentally different in both the user's experience and the architecture. For example:
Drill 1.0 was just released (May 19, 2015). You can easily download it onto your laptop and play with it without any infrastructure (Hadoop, NoSQL, etc.).
Drill provides the ability for you to query different kinds of datasets with ANSI SQL. This makes it great for adhoc data exploration, and connecting BI tools to datasets via ODBC. You can even use Drill to SQL JOIN different kinds of datasets. For example, you could join records in a MySQL table with rows in a JSON file, or a CSV file, or OpenTSDB, or MapR-DB... the list goes on. Drill can connect to lots of different types of data.
When I think to use Spark, I'm typically wanting to use it for RDDs (resilient distributed dataset). RDDs make it easy to process a lot of data, quickly. Spark also has a bunch of libraries for ML and streaming. Drill doesn't process data at all. It just gets you access to said data. You could use Drill to pull data into Spark, or Tensorflow, or PySpark, or Tableau, etc.