I'm just wondering what is the difference between an RDD
and DataFrame
(Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]
) in Apache Spark?
Can you convert one to the other?
I'm just wondering what is the difference between an RDD
and DataFrame
(Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]
) in Apache Spark?
Can you convert one to the other?
A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0
Simply
RDD
is core component, butDataFrame
is an API introduced in spark 1.30.RDD
Collection of data partitions called
RDD
. TheseRDD
must follow few properties such is:Here
RDD
is either structured or unstructured.DataFrame
DataFrame
is an API available in Scala, Java, Python and R. It allows to process any type of Structured and semi structured data. To defineDataFrame
, a collection of distributed data organized into named columns calledDataFrame
. You can easily optimize theRDDs
in theDataFrame
. You can process JSON data, parquet data, HiveQL data at a time by usingDataFrame
.Here Sample_DF consider as
DataFrame
.sampleRDD
is (raw data) calledRDD
.Yes.. conversion between
Dataframe
andRDD
is absolutely possible.Below are some sample code snippets.
df.rdd
isRDD[Row]
Below are some of options to create dataframe.
1)
yourrddOffrow.toDF
converts toDataFrame
.2) Using
createDataFrame
of sql contextval df = spark.createDataFrame(rddOfRow, schema)
In fact there Are Now 3 Apache Spark APIs..
RDD
API :RDD Example:
Example: Filter by attribute with RDD
DataFrame
APIExample SQL style :
df.filter("age > 21");
Limitations : Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.
Another downside with the
DataFrame
API is that it is very scala-centric and while it does support Java, the support is limited.For example, when creating a
DataFrame
from an existingRDD
of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement thescala.Product
interface. Scalacase class
works out the box because they implement this interface.Dataset
APIExample
Dataset
API SQL style :Evaluations diff. between
DataFrame
&DataSet
:Further reading... databricks article
A Dataframe is an RDD of Row objects, each representing a record. A Dataframe also knows the schema (i.e., data fields) of its rows. While Dataframes look like regular RDDs, internally they store data in a more efficient manner, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. Dataframes can be created from external data sources, from the results of queries, or from regular RDDs.
Reference: Zaharia M., et al. Learning Spark (O'Reilly, 2015)
A
DataFrame
is defined well with a google search for "DataFrame definition":So, a
DataFrame
has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.An
RDD
, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.However, you can go from a DataFrame to an
RDD
via itsrdd
method, and you can go from anRDD
to aDataFrame
(if the RDD is in a tabular format) via thetoDF
methodIn general it is recommended to use a
DataFrame
where possible due to the built in query optimization.