Difference between DataFrame (in Spark 2.0 i.e Dat

2019-01-01 07:42发布

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark?

Can you convert one to the other?

11条回答
牵手、夕阳
2楼-- · 2019-01-01 08:30

A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0

查看更多
无色无味的生活
3楼-- · 2019-01-01 08:32

Simply RDD is core component, but DataFrame is an API introduced in spark 1.30.

RDD

Collection of data partitions called RDD. These RDD must follow few properties such is:

  • Immutable,
  • Fault Tolerant,
  • Distributed,
  • More.

Here RDD is either structured or unstructured.

DataFrame

DataFrame is an API available in Scala, Java, Python and R. It allows to process any type of Structured and semi structured data. To define DataFrame, a collection of distributed data organized into named columns called DataFrame. You can easily optimize the RDDs in the DataFrame. You can process JSON data, parquet data, HiveQL data at a time by using DataFrame.

val sampleRDD = sqlContext.jsonFile("hdfs://localhost:9000/jsondata.json")

val sample_DF = sampleRDD.toDF()

Here Sample_DF consider as DataFrame. sampleRDD is (raw data) called RDD.

查看更多
旧时光的记忆
4楼-- · 2019-01-01 08:33

First thing is DataFrame was evolved from SchemaRDD.

depreated method toSchemaRDD

Yes.. conversion between Dataframe and RDD is absolutely possible.

Below are some sample code snippets.

  • df.rdd is RDD[Row]

Below are some of options to create dataframe.

  • 1) yourrddOffrow.toDF converts to DataFrame.

  • 2) Using createDataFrame of sql context

    val df = spark.createDataFrame(rddOfRow, schema)

where schema can be from some of below options as described by nice SO post..
From scala case class and scala reflection api

import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType]

OR using Encoders

import org.apache.spark.sql.Encoders
val mySchema = Encoders.product[MyCaseClass].schema

as described by Schema can also be created using StructType and StructField

val schema = new StructType()
  .add(StructField("id", StringType, true))
  .add(StructField("col1", DoubleType, true))
  .add(StructField("col2", DoubleType, true)) etc...

image description

In fact there Are Now 3 Apache Spark APIs..

enter image description here

  1. RDD API :

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release.

The RDD API provides many transformation methods, such as map(), filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

RDD Example:

rdd.filter(_.age > 21) // transformation
   .map(_.last)// transformation
.saveAsObjectFile("under21.bin") // action

Example: Filter by attribute with RDD

rdd.filter(_.age > 21)
  1. DataFrame API

Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization.

The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans

Example SQL style :

df.filter("age > 21");

Limitations : Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.

Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited.

For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case class works out the box because they implement this interface.

  1. Dataset API

The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.

Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant.

Example Dataset API SQL style :

dataset.filter(_.age < 21);

Evaluations diff. between DataFrame & DataSet : enter image description here

Further reading... databricks article

查看更多
何处买醉
5楼-- · 2019-01-01 08:34

A Dataframe is an RDD of Row objects, each representing a record. A Dataframe also knows the schema (i.e., data fields) of its rows. While Dataframes look like regular RDDs, internally they store data in a more efficient manner, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. Dataframes can be created from external data sources, from the results of queries, or from regular RDDs.

Reference: Zaharia M., et al. Learning Spark (O'Reilly, 2015)

查看更多
还给你的自由
6楼-- · 2019-01-01 08:35

A DataFrame is defined well with a google search for "DataFrame definition":

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

In general it is recommended to use a DataFrame where possible due to the built in query optimization.

查看更多
登录 后发表回答