Apache Drill vs Spark

I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what are the most significant advantages/differences between them? I've already read Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill) but this topic is still unclear for me.

标签： hadoop apache-spark bigdata apache-drill

2条回答

Rolldiameter

2楼-- · 2019-03-15 06:49

Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/

Drill is fundamentally different in both the user's experience and the architecture. For example:

Drill is a schema-free query engine. For instance, you can point it at a directory of JSON or Parquet log files (on your local box, an NFS share, S3, HDFS, MapR-FS, etc.) and run a query. You don't have to load data, create and manage schemas or pre-process the data.
Drill uses a JSON document model internally which allows it to query data of any structure. A lot of modern data is complex, meaning a record can contain nested structures and arrays, and field names may actually encode values such timestamps or web page URLs. Drill allows normal BI tools to operate seamlessly on such data without requiring the data to be flattened in advance.
Drill works with a variety of non-relational datastores, including Hadoop, NoSQL databases (MongoDB, HBase) and cloud storage. Additional datastores will be added.

Drill 1.0 was just released (May 19, 2015). You can easily download it onto your laptop and play with it without any infrastructure (Hadoop, NoSQL, etc.).

0人赞添加讨论(0) 举报

Viruses.

3楼-- · 2019-03-15 06:54

Drill provides the ability for you to query different kinds of datasets with ANSI SQL. This makes it great for adhoc data exploration, and connecting BI tools to datasets via ODBC. You can even use Drill to SQL JOIN different kinds of datasets. For example, you could join records in a MySQL table with rows in a JSON file, or a CSV file, or OpenTSDB, or MapR-DB... the list goes on. Drill can connect to lots of different types of data.

When I think to use Spark, I'm typically wanting to use it for RDDs (resilient distributed dataset). RDDs make it easy to process a lot of data, quickly. Spark also has a bunch of libraries for ML and streaming. Drill doesn't process data at all. It just gets you access to said data. You could use Drill to pull data into Spark, or Tensorflow, or PySpark, or Tableau, etc.

0人赞添加讨论(0) 举报

Apache Drill vs Spark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间