We have numerous files in S3 totally tens of gigabytes. We need to get them into CSV format, currently the files have delimiters that are not commas. Normally I would do this on a server using sed but I don't want to have to transfer the files to a server, I want to read directly from S3, translate to CSV line by line, and write the results back to new S3 files.
Glue appears to be able to do this but I sense the learning curve and setup for such a simple task is overkill.
Is there not some easy way to do easy tasks such as this, maybe in EMR or some other AWS tool? We use Athena and I'm wonder if this could be done in an SQL statement using Athena? Thanks
Yes that should be very easy and you dont need any external ETL tool or glue.
Suppose you have a pipe delimited table named "cust_transaction_pipe" which is based on a pipe delimited file and you can query the table using Athena without any issues. To convert that table to comma delimited, just use the query below:
create table cust_transaction_csv
with (external_location = 's3://YOUR_S3_BUCKET_NAME/cust_tx_csv/',format='TEXTFILE',field_delimiter = ',')
as
select * from cust_transaction_pipe
Once its complete, you can check the location you specified. There will be file comma delimited. You can specify lots of other options inside the WITH (). For complete set of options, please see the Athena AWS Documentation link here.