I'm running a Redshift unload command, but am not getting the name I desire. The command is:
UNLOAD ('select * from foo')
TO 's3://mybucket/foo'
CREDENTIALS 'xxxxxx'
GZIP
NULL AS 'NULL'
DELIMITER as '\t'
allowoverwrite
parallel off
The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:
- mybucket/foo-000.txt.gz
- mybucket/foo.txt.gz
Is there any way to do this (without writing a lambda post process renamer script)?
TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use
PARALLEL FALSE
, but it is strongly recommended to leave it enabled. Even then, the file will always include the000.[EXT]
suffix (when the[EXT]
exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:Therefore, it will alway add at least the prefix
000
, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.If you ask why the use of
PARALLEL FALSE
is not recommended, I'll try to explain it in several points:When you unload data from Redshift while the flag
PARALLEL
isTRUE
, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
The queries
COPY
andUNLOAD
work directly with the data nodes, therefore, they behave almost the same way as if you would usePARALLEL TRUE
. In the contrary, queries likeSELECT
,UPDATE
,DELETE
andINSERT
, are processed by the leader node, that's why they suffer from the leader node loads.Little of topic, but there is no real reason for naming file in specific order like you require: "foo.txt.gz" since after your file is put on the bucket as foo.000 you, will most likely either download it by browser - so you would set the HTTP headers headers with desired name for end user for that action: