How to delete and update a record in Hive

2020-01-25 12:47发布

I have install Hadoop, Hive, Hive JD BC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in hive.

Thanks

hive> delete from student where id=1;
Usage: delete [FILE|JAR|ARCHIVE] <value> [<value>]*
Query returned non-zero code: 1, cause: null

14条回答
祖国的老花朵
2楼-- · 2020-01-25 13:47

Once you have installed and configured Hive , create simple table :

hive>create table testTable(id int,name string)row format delimited fields terminated by ',';

Then, try to insert few rowsin test table.

hive>insert into table testTable values (1,'row1'),(2,'row2');

Now try to delete records , you just inserted in table.

hive>delete from testTable where id = 1;

Error! FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

By default transactions are configured to be off. It is been said that update is not supported with the delete operation used in the conversion manager. To support update/delete , you must change following configuration.

cd  $HIVE_HOME
vi conf/hive-site.xml

Add below properties to file

<property>
  <name>hive.support.concurrency</name>
  <value>true</value>
 </property>
 <property>
  <name>hive.enforce.bucketing</name>
  <value>true</value>
 </property>
 <property>
  <name>hive.exec.dynamic.partition.mode</name>
  <value>nonstrict</value>
 </property>
 <property>
  <name>hive.txn.manager</name>
  <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
 </property>
 <property>
  <name>hive.compactor.initiator.on</name>
  <value>true</value>
 </property>
 <property>
  <name>hive.compactor.worker.threads</name>
  <value>2</value>
 </property>

Restart the service and then try delete command again :

Error!

FAILED: LockException [Error 10280]: Error communicating with the metastore.

There is problem with metastore. In order to use insert/update/delete operation, You need to change following configuration in conf/hive-site.xml as feature is currently in development.

<property>
  <name>hive.in.test</name>
  <value>true</value>
 </property>

Restart the service and then delete command again :

hive>delete from testTable where id = 1;

Error!

FAILED: SemanticException [Error 10297]: Attempt to do update or delete on table default.testTable that does not use an AcidOutputFormat or is not bucketed.

Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.

Tables must be bucketed to make use of these features. Tables in the same system not using transactions and ACID do not need to be bucketed.

See below built table example with ORCFileformat, bucket enabled and ('transactional'='true').

hive>create table testTableNew(id int ,name string ) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');

Insert :

hive>insert into table testTableNew values (1,'row1'),(2,'row2'),(3,'row3');

Update :

hive>update testTableNew set name = 'updateRow2' where id = 2;

Delete :

hive>delete from testTableNew where id = 1;

Test :

hive>select * from testTableNew ;
查看更多
再贱就再见
3楼-- · 2020-01-25 13:48

You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.

The following applies to versions prior to Hive 0.14, see the answer by ashtonium for later versions.

There is no operation supported for deletion or update of a particular record or particular set of records, and to me this is more a sign of a poor schema.

Here is what you can find in the official documentation:

Hadoop is a batch processing system and Hadoop jobs tend to have high latency and
incur substantial overheads in job submission and scheduling. As a result -
latency for Hive queries is generally very high (minutes) even when data sets
involved are very small (say a few hundred megabytes). As a result it cannot be
compared with systems such as Oracle where analyses are conducted on a
significantly smaller amount of data but the analyses proceed much more
iteratively with the response times between iterations being less than a few
minutes. Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer
real-time queries and row level updates. It is best used for batch jobs over
large sets of immutable data (like web logs).

A way to work around this limitation is to use partitions: I don't know what you id corresponds to, but if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.

查看更多
登录 后发表回答