Insert data into a Hive table with HiveContext usi

2019-05-10 10:30发布

I was able to insert data into a Hive table from my spark code using HiveContext like below

   val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   sqlContext.sql("CREATE TABLE IF NOT EXISTS e360_models.employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")

   sqlContext.sql("insert into table e360_models.employee select t.* from (select 1210, 'rahul', 55) t")
   sqlContext.sql("insert into table e360_models.employee select t.* from (select 1211, 'sriram pv', 35) t")
   sqlContext.sql("insert into table e360_models.employee select t.* from (select 1212, 'gowri', 59) t")

   val result = sqlContext.sql("FROM e360_models.employee SELECT id, name, age")
   result.show()

But, this approach is creating a separate file in the warehouse for every insertion like below

part-00000
part-00000_copy_1
part-00000_copy_2
part-00000_copy_3

Is there any way to avoid this and just append the new data to a single file or is there any other better way to insert data into hive from spark?

标签： scala hadoop apache-spark hive apache-spark-sql

1条回答

Juvenile、少年°

2楼-- · 2019-05-10 10:51

No, there is no way to do that. Each new insert will create a new file. It's not a Spark "issue", but a general behavior you can experience with Hive too. The only way is to perform a single insert with the UNION of all your data, but if you need to do multiple inserts, you'll have multiple files.

The only thing you can do is to enable file merging in hive (look at it here: Hive Create Multi small files for each insert in HDFS and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties).

0人赞添加讨论(0) 举报

Insert data into a Hive table with HiveContext usi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间