Is it possible to import data into Hive table with

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.

Can I avoid having all my text data stored twice?

EDIT: I load it via the following command

LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Then, I can find the exact same file in:

/user/hive/warehouse/sandbox.db/test/day=20130220

I assumed it was copied.

标签： hadoop hive hdfs

4条回答

姐就是有狂的资本

2楼-- · 2019-04-05 10:01

I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.

LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Notice the LOCAL

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-04-05 10:02

use an external table:

CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/user/logs/';

if you want to use partitioning with an external table, you will be responsible for managing the partition directories. the location specified must be an hdfs directory..

If you drop an external table hive WILL NOT delete the source data. If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.

0人赞添加讨论(0) 举报

叛逆

4楼-- · 2019-04-05 10:08

Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.

0人赞添加讨论(0) 举报

男人必须洒脱

5楼-- · 2019-04-05 10:13

You can use alter table partition statement to avoid data duplication.

create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';

ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';

0人赞添加讨论(0) 举报

Is it possible to import data into Hive table with

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间