可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am newbie to Apache Spark.

My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.

For example,

CSV1

name,age,deparment_id

CSV2

department_id,deparment_name,location

I want to get a third CSV file with

name,age,deparment_name

I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe

I am also able to do the same using several RDD.map()

And I am also able to do the same using executing hiveql using HiveContext

I want to know which is the efficient way if my CSV files are huge and why?

回答1:

Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance (assuming you are using version >= 1.3)

And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations

回答2:

This blog contains the benchmarks. Dataframes is much more efficient than RDD

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Here is the snippet from blog

At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic. Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.

Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png