I am newbie to Apache Spark.
My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.
For example,
CSV1
name,age,deparment_id
CSV2
department_id,deparment_name,location
I want to get a third CSV file with
name,age,deparment_name
I am loading both the CSV into dataframes.
And then able to get the third dataframe using several methods join,select,filter,drop
present in dataframe
I am also able to do the same using several RDD.map()
And I am also able to do the same using executing hiveql
using HiveContext
I want to know which is the efficient way if my CSV files are huge and why?
Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance
(assuming you are using version >= 1.3)
And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations
This blog contains the benchmarks. Dataframes is much more efficient than RDD
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Here is the snippet from blog
At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic.
Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.
Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png
Overall direction for Spark is to go with dataframes, so that query is optimized through catalyst