Difference between Hive internal tables and extern

2020-01-23 14:58发布

Can anyone tell me the difference between Hive's external table and internal tables. I know the difference comes when dropping the table. I don't understand what you mean by the data and metadata is deleted in internal and only metadata is deleted in external tables. Can anyone explain me in terms of nodes please.

17条回答
等我变得足够好
2楼-- · 2020-01-23 15:41

The only difference in behaviour (not the intended usage) based on my limited research and testing so far (using Hive 1.1.0 -cdh5.12.0) seems to be that when a table is dropped

  • the data of the Internal (Managed) tables gets deleted from the HDFS file system
  • while the data of the External tables does NOT get deleted from the HDFS file system.

(NOTE: See Section 'Managed and External Tables' in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL which list some other difference which I did not completely understand)

I believe Hive chooses the location where it needs to create the table based on the following precedence from top to bottom

  1. Location defined during the Table Creation
  2. Location defined in the Database/Schema Creation in which the table is created.
  3. Default Hive Warehouse Directory (Property hive.metastore.warehouse.dir in hive.site.xml)

When the "Location" option is not used during the "creation of a hive table", the above precedence rule is used. This is applicable for both Internal and External tables. This means an Internal table does not necessarily have to reside in the Warehouse directory and can reside anywhere else.

Note: I might have missed some scenarios, but based on my limited exploration, the behaviour of both Internal and Extenal table seems to be the same except for the one difference (data deletion) described above. I tried the following scenarios for both Internal and External tables.

  1. Creating table with and without Location option
  2. Creating table with and without Partition Option
  3. Adding new data using the Hive Load and Insert Statements
  4. Adding data files to the Table location outside of Hive (using HDFS commands) and refreshing the table using the "MSCK REPAIR TABLE command
  5. Dropping the tables
查看更多
仙女界的扛把子
3楼-- · 2020-01-23 15:41

External hive table has advantages that it does not remove files when we drop tables,we can set row formats with different settings , like serde....delimited

查看更多
叛逆
4楼-- · 2020-01-23 15:41

For managed tables, Hive controls the lifecycle of their data. Hive stores the data for managed tables in a sub-directory under the directory defined by hive.metastore.warehouse.dir by default.

When we drop a managed table, Hive deletes the data in the table.But managed tables are less convenient for sharing with other tools. For example, lets say we have data that is created and used primarily by Pig , but we want to run some queries against it, but not give Hive ownership of the data.

At that time, external table is defined that points to that data, but doesn’t take ownership of it.

查看更多
狗以群分
5楼-- · 2020-01-23 15:44

Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

  1. The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
  2. Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  3. You want to use a custom location such as ASV.
  4. Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  5. You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary.

You want Hive to completely manage the lifecycle of the table and data.

查看更多
我命由我不由天
6楼-- · 2020-01-23 15:45

The best use case for an external table in the hive is when you want to create the table from a file either CSV or text

查看更多
我只想做你的唯一
7楼-- · 2020-01-23 15:46

Also Keep in mind that Hive is a big data warehouse. When you want to drop a table you dont want to lose Gigabytes or Terabytes of data. Generating, moving and copying data at that scale can be time consuming. When you drop a 'Managed' table hive will also trash its data. When you drop a 'External' table only the schema definition from hive meta-store is removed. The data on the hdfs still remains.

查看更多
登录 后发表回答