Hive External Table vs Internal table commands

2019-03-01 10:22发布

问题:

Assuming I have these two tables:

External:

create external table emp_feedback (
  emp_id int,
  emp_name string
)
LOCATION '/user/hive/warehouse/mydb.db/contacts';

Internal:

create table emp_feedback (
  emp_id int,
  emp_name string
)
LOAD DATA INPATH 'file_location_of_csv' INTO TABLE emp_feedback;
  1. When I say: LOCATION '/user/hive/warehouse/mydb.db/contacts'; for the external table does that mean that the data for that table is found in the directory '/user/hive/warehouse/mydb.db/contacts';? So that directory has to exist in HDFS before hand?
  2. Can I use LOAD DATA INPATH... for an external table or is that only used for internal tables. And vice versa can I use Location... for an internal table?

回答1:

  1. (a) Yes. You are right, it means that the data is found in that location/directory
  2. (b) No. The directory doesn't have to exist to create a Schema, Hive will create the directory if it doesn't exist. But there is no point in doing as your table will be empty therefore your query will be empty. But in the future, you can move data to that location and use that table.

  3. (a) LOAD DATA INPATH can be used for both external and internal tables. When you do, it moves the data, to the location specified by the schema (for external tables) or to /.../warehouse/... (for internal tables)

  4. (b) location can be specified for both internal and external tables. But when you drop the internal table, it will also remove the data from that location, whereas only meta data information is removed for external tables.


回答2:

load data inpath command is use to load data into hive table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.

load data inpath '/path/file.csv' into mytable; 
load data local inpath '/Local path/file.csv' into mytable;

This command will remove content at source directory and create a internal table

LOCATION keyword allow to points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.

In other words, with specified LOCATION '/your path/', Hive does not use a default location for this table. This comes in handy if you already have data generated.

Remember, LOCATION can be specify on EXTERNAL tables only. For regular tables, default location will be used. Create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.

When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.

To summarize, load data inpath tell hive where to look for input files and LOCATION keyword tells hive where to save output files on HDFS.



标签: hadoop hive