I'm creating a new table in Hive using:
CREATE TABLE new_table AS select * from old_table;
My problem is that after the table is created, It generates multiple files for each partition - while I want only one file for each partition.
How can I define it in the table? Thank you!
There are many possible solutions:
1) Add
distribute by partition key
at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well.hive.exec.reducers.bytes.per.reducer
setting will define how much data each reducer will process.2) Simple, quite good if there are not too much data: add
order by
to force single reducer. Or increasehive.exec.reducers.bytes.per.reducer=500000000;
--500M files. This is for single reducer solution is for not too much data, it will run slow if there are a lot of data.If your task is map-only then better consider options 3-5:
3) If running on mapreduce, switch-on merge:
4) When running on Tez
5) For ORC files you can merge files efficiently using this command:
ALTER TABLE T [PARTITION partition_spec] CONCATENATE;
- for ORC