可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

The file that I am loading is separated by ' ' (white space). Below is the file. The file resides in HDFS:-

1> I am creating an external table and loading the file by issuing the below command:-

CREATE EXTERNAL TABLE IF NOT EXISTS graph_edges (src_node_id STRING COMMENT 'Node ID of Source node', dest_node_id STRING COMMENT 'Node ID of Destination node') ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/hadoop/input';

2> After this, I am simply inserting the table in another file by issuing the below command:-

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT * FROM graph_edges;

3> Now, when I cat the file, the fields are not separated by any delimiter:-

hadoop dfs -cat /user/hadoop/output/000000_0

Output:-

Can somebody please help me out? Why is the delimiter being removed and how to delimit the output file?

In the CREATE TABLE command I tried DELIMITED BY '\t' but then I am getting unnecessary NULL column.

Any pointers help much appreciated. I am using Hive 0.9.0 version.

回答1:

The problem is that HIVE does not allow you to specify the output delimiter - https://issues.apache.org/jira/browse/HIVE-634

The solution is to create external table for output (with delimiter specification) and insert overwrite table instead of directory.

Assuming that you have /user/hadoop/input/graph_edges.csv in HDFS,

hive> create external table graph_edges (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '\n' 
    > stored as textfile location '/user/hadoop/input';

hive> select * from graph_edges;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

hive> create external table graph_out (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '\n' 
    > stored as textfile location '/user/hadoop/output';

hive> insert into table graph_out select * from graph_edges;
hive> select * from graph_out;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

[user@box] hadoop fs -get /user/hadoop/output/000000_0 .

Comes back as above, with spaces.

回答2:

I think using the concat_ws function you can achieve your output;

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT concat_ws(',', col1, col2) FROM graph_edges;

here i have chosen comma as the column delimiter

回答3:

While the question is over 2 years old and the top answer was correct at the time, it is now possible to tell Hive to write delimited data to a directory.

Here is an example of outputting the data with the traditional ^A separator:

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
SELECT *
FROM data_schema.data_table

And now with tab delimiters:

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
row format delimited 
FIELDS TERMINATED BY '\t'
SELECT *
FROM data_schema.data_table

回答4:

I have some different voice.

Indeed, Hive does not support custom delimiter.

But when you use INSERT OVERWRITE DIRECTORY, there are delimiters in your lines. The delimiter is '\1'.

You can use hadoop dfs -cat $file | head -1 | xxd to find it out or get the file from HDFS to local machine and open it with vim. There will be some char like '^A' in your vim which is the delimiter.

Back to the question, You can use a simple way to solve it.

Still use INSERT OVERWRITE DIRECTORY '/user/hadoop/output' to generate /user/hadoop/output;

Create external table whose fields delimited by '\1':

create external table graph_out (src string, dest string) 
row format delimited 
fields terminated by '\1' 
lines terminated by '\n' 
stored as textfile location '/user/hadoop/output';

回答5:

You can provide delimiter when writing to directories

INSERT OVERWRITE DIRECTORY '/user/hadoop/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY
SELECT * FROM graph_edges;

This should work for you.

回答6:

The default separator is "^A". In python language, it is "\x01".

When I want to change the delimiter, I use SQL like:

SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table

Then, regard delimiter+"^A" as a new delimiter.

回答7:

I suspect that hive actually is writing a contol-A as the delimeter, but when you do a cat to the screen it is not showing up to your eye.

Instead try bringing up the file in vi or head the file if you only want to see a little of it, and vi the result:

hadoop dfs -cat /user/hadoop/output/000000_0 | head > my_local_file.txt

vi my_local_file.txt

You should be able to see the ^A characters in there.

回答8:

I had this issue where the output of the hive query results should be pipe delimited.. Running this sed command you can replace: ^A to |

sed 's#\x01#|#g' test.log > piped_test.log

回答9:

This would be a better solution I suppose though its a round about way of achieving.

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT src_node_id,' ',dest_node_id FROM graph_edges;

回答10:

you can use this parameter "row format delimited fields terminated by '|'" for example in your case should be

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' row format delimited fields terminated by '|' SELECT * FROM graph_edges;

Hive INSERT OVERWRITE DIRECTORY command output is

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

回答9:

回答10:

收藏的人(0)

Hive INSERT OVERWRITE DIRECTORY command output is

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

回答9:

回答10:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮