Simplest tool in AWS for very simple (transform in

2019-08-31 16:00发布

问题:

We have numerous files in S3 totally tens of gigabytes. We need to get them into CSV format, currently the files have delimiters that are not commas. Normally I would do this on a server using sed but I don't want to have to transfer the files to a server, I want to read directly from S3, translate to CSV line by line, and write the results back to new S3 files.

Glue appears to be able to do this but I sense the learning curve and setup for such a simple task is overkill.

Is there not some easy way to do easy tasks such as this, maybe in EMR or some other AWS tool? We use Athena and I'm wonder if this could be done in an SQL statement using Athena? Thanks

回答1:

Yes that should be very easy and you dont need any external ETL tool or glue. Suppose you have a pipe delimited table named "cust_transaction_pipe" which is based on a pipe delimited file and you can query the table using Athena without any issues. To convert that table to comma delimited, just use the query below:

create table cust_transaction_csv 
with (external_location = 's3://YOUR_S3_BUCKET_NAME/cust_tx_csv/',format='TEXTFILE',field_delimiter = ',')
as 
select * from cust_transaction_pipe

Once its complete, you can check the location you specified. There will be file comma delimited. You can specify lots of other options inside the WITH (). For complete set of options, please see the Athena AWS Documentation link here.