How updating data in hive transaction tables resul

2019-07-29 03:57发布

问题:

By enabling transactions in Hive, we can update records. Assuming I'm using AVRO format for my hive table.

https://hortonworks.com/hadoop-tutorial/using-hive-acid-transactions-insert-update-delete-data/

How does hive takes care of updating an AVRO file and replicating them again on different server ( coz replication factor is 3 ).

I could not find a good article which explains this, and the consequence of using ACID in Hive. Since HDFS is recommended for non-updating or append only files, how does this updating a record in between works.

Please advise.

回答1:

Data for the table is stored in a set of base files. New records, updates, and deletes are stored in delta files. A new set of delta files is created for each transaction (or in the case of streaming agents such as Flume or Storm, each batch of transactions) that alters a table. At read time the reader merges the base and delta files, applying any updates and deletes as it reads.

Subsequently, the major compaction merges the larger delta files and/or base file into another base file on periodic interval of time that would speed up the further table scan operation.

Inserted/updated/deleted data are periodically compacted to save space and optimize data access.

The ACID Transaction feature currently has these limitations:

  1. It only works for ORC file. There is a JIRA in open source to add support for Parquet tables.
  2. It works only for non-sorted bucketed tables.
  3. INSERT OVERWRITE is not supported for transactions.
  4. It does not support for BEGIN, COMMIT, or ROLLBACK Transactions.
  5. It is not recommended for OLTP.

ACID doesn't support with AVRO file and HDFS block replacement policies are same for ACID tables too.

Below link can be more helpful to understand ACID tables in Hive.

http://docs.qubole.com/en/latest/user-guide/hive/use-hive-acid.html

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions



标签: hive avro