Confusion with the external tables in hive

2019-08-13 00:56发布

问题:

I have created the hive external table using below command:

 use hive2;

create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by "," 
stored as textfile location '/dataDir/';  

Now, when I view the HDFS I can see the db but there is no depTable inside the warehouse.

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
[cloudera@quickstart ~]$ 

Above you can see that there is no table created in this DB. As far as I know, external tables are not stored in the hive warehouse.So am I correct ?? If yes then where is it stored ??
But if I create external table first and then load the data then I am able to see the file inside hive2.db.

hive> create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by "," stored as textfile;
OK
Time taken: 0.056 seconds
hive> load data inpath '/dataDir/department_data.txt' into table depTable;
Loading data to table default.deptable
Table default.deptable stats: [numFiles=1, totalSize=90]
OK
Time taken: 0.28 seconds
hive> select * from deptable;
OK
1001    FINANCE SYDNEY
2001    AUDIT   MELBOURNE
3001    MARKETING   PERTH
4001    PRODUCTION  BRISBANE

Now, if I fire the hadoop fs query I can see this table under database as below:

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2019-01-17 09:07 /user/hive/warehouse/hive2.db/deptable  

If I delete the table still I am able to see table in the HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2019-01-17 09:11 /user/hive/warehouse/hive2.db/deptable  

So, what is the exact behavior of the external tables ?? When I create using LOCATION keyword where does it get stored and when I create using load statement why it is getting stored in the HDFS and after deleting why it doesn't get deleted.

回答1:

The main difference between EXTERNAL and MANAGED tables is in Drop table/partition behavior. When you drop MANAGED table/partition, the location with data files also removed. When you drop EXTERNAL table, the location with data files remains as is.

EXTERNAL table as well as MANAGED is being stored in the location specified in DDL. You can create table on top of existing location with data files already in the location and it will work for both EXTERNAL or MANAGED, does not matter.

You even can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344

If you specified location, the data will be stored in that location for both types of tables. If you did not specify location, the data will be in default location: /user/hive/warehouse/database_name.db/table_name for both managed and external tables.

See also official Hive docs on Managed vs External Tables



标签: hadoop hive