Overwrite specific partitions in spark dataframe w

2019-01-04 23:49发布

I want to overwrite specific partitions instead of all in spark. I am trying the following command:

df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')

where df is dataframe having the incremental data to be overwritten.

hdfs-base-path contains the master data.

When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.

What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?

9条回答
forever°为你锁心
2楼-- · 2019-01-05 00:00

If you use DataFrame, possibly you want to use Hive table over data. In this case you need just call method

df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)

It'll overwrite partitions that DataFrame contains.

There's not necessity to specify format (orc), because Spark will use Hive table format.

It works fine in Spark version 1.6

查看更多
唯我独甜
3楼-- · 2019-01-05 00:04

I tried below approach to overwrite particular partition in HIVE table.

### load Data and check records
    raw_df = spark.table("test.original")
    raw_df.count()

lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925


### Check data in few partitions.
    sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
    print "Number of records: ", sample.count()
    sample.show()


### Back-up the partitions before deletion
    raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")


### UDF : To delete particular partition.
    def delete_part(table, part):
        qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
        spark.sql(qry)


### Delete partitions
    part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
    part_list = part_df.rdd.map(lambda x : x[0]).collect()

    table = "test.original"
    for p in part_list:
        delete_part(table, p)


### Do the required Changes to the columns in partitions
    df = spark.table("test.original_bkp")
    newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
    newdf.select("c_customer_sk", "c_preferred_cust_flag").show()


### Write the Partitions back to Original table
    newdf.write.insertInto("test.original")


### Verify data in Original table
    orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()



Hope it helps.

Regards,

Neeraj
查看更多
Bombasti
4楼-- · 2019-01-05 00:08

This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,

df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")

If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)

You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.

查看更多
一纸荒年 Trace。
5楼-- · 2019-01-05 00:09

You could do something like this to make the job reentrant (idempotent): (tried this on spark 2.2)

# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)

# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)

# Load the partition
df.write\
  .partitionBy("partition_col")\
  .saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
查看更多
相关推荐>>
6楼-- · 2019-01-05 00:10

Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.

CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';

Once the table is created, you would write your data to the tmpLocation

df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)

Then you would recover the table partition paths by executing:

MSCK REPAIR TABLE tmpTbl;

Get the partition paths by querying the Hive metadata like:

SHOW PARTITONS tmpTbl;

Delete these partitions from the trgtTbl and move the directories from tmpTbl to trgtTbl

查看更多
手持菜刀,她持情操
7楼-- · 2019-01-05 00:13

As jatin Wrote you can delete paritions from hive and from path and then append data Since I was wasting too much time with it I added the following example for other spark users. I used Scala with spark 2.2.1

  import org.apache.hadoop.conf.Configuration
  import org.apache.hadoop.fs.Path
  import org.apache.spark.SparkConf
  import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}

  case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)

 object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"

val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"

val someText = List("text1", "text2")
val ids = (0 until 5).toList

val listData = partitions1.flatMap(p1 => {
  partitions2.flatMap(p2 => {
    someText.flatMap(
      text => {
        ids.map(
          id => DataExample(p1, p2, text, id)
        )
      }
    )
  }
  )
})

val asDataFrame = spark.createDataFrame(listData)

//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
  val p = new Path(path)
  val fs = p.getFileSystem(new Configuration())
  fs.delete(p, recursive)
}

def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
  if (spark.catalog.tableExists(tableName)) {
    //clean partitions
    val asColumns = partitions.map(c => new Column(c))
    val relevantPartitions = df.select(asColumns: _*).distinct().collect()
    val partitionToRemove = relevantPartitions.map(row => {
      val fields = row.schema.fields
      s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
        s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
    })

    val cleanFolders = relevantPartitions.map(partition => {
      val fields = partition.schema.fields
      path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
    })

    println(s"Going to clean ${partitionToRemove.size} partitions")
    partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
    cleanFolders.foreach(partition => deletePath(partition, true))
  }
  asDataFrame.write
    .options(Map("path" -> myTablePath))
    .mode(SaveMode.Append)
    .partitionBy(partitionColumns: _*)
    .saveAsTable(tableName)
}

//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)

import spark.implicits._

val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
  println("Overwrite is working !!!")
}

}

查看更多
登录 后发表回答