delete s3 files from a pipeline AWS

I would like to ask about a processing task I am trying to complete using a data pipeline in AWS, but I have not been able to get it to work.

Basically, I have 2 data nodes representing 2 MySQL databases, where the data is supposed to be extracted from periodically and placed in an S3 bucket. This copy activity is working fine selecting daily every row that has been added, let's say today - 1 day.

However, that bucket containing the collected data as CSVs should become the input for an EMR activity, which will be processing those files and aggregating the information. The problem is that I do not know how to remove or move the already processed files to a different bucket so I do not have to process all the files every day.

To clarify, I am looking for a way to move or remove already processed files in an S3 bucket from a pipeline. Can I do that? Is there any other way I can only process some files in an EMR activity based on a naming convention or something else?

标签： amazon-web-services emr amazon-data-pipeline

3条回答

够拽才男人

2楼-- · 2019-07-17 19:36

Even better, create a DataPipeline ShellCommandActivity and use the aws command line tools.

Create a script with these two lines:

    sudo yum -y upgrade aws-cli 
    aws s3 rm $1 --recursive

The first line ensures you have the latest aws tools.

The second one removes a directory and all its contents. The $1 is an argument passed to the script.

In your ShellCommandActivity:

    "scriptUri": "s3://myBucket/scripts/theScriptAbove.sh",
    "scriptArgument": "s3://myBucket/myDirectoryToBeDeleted"

The details on how the aws s3 command works are at:

    http://docs.aws.amazon.com/cli/latest/reference/s3/index.html

0人赞添加讨论(0) 举报

相关推荐>>

3楼-- · 2019-07-17 19:37

Another approach without using EMR is to install s3cmd tool through ShellCommandActivity in a small EC2 instance, then you can use s3cmd in pipeline to operate your S3 repo in whatever way you want.

A tricky part of this approach is to configure s3cmd through a configuration file safely (basically pass access key and secret), as you can't just ssh into the EC2 instance and use 's3cmd --configure' interactively in a pipeline.

To do that, you create a config file in the ShellCommandActivity using 'cat'. For example:

cat <<EOT >> s3.cfg
blah
blah
blah
EOT

Then use '-c' option to attach the config file every time you call s3cmd like this:

s3cmd -c s3.cfg ls

Sounds complicated, but works.

0人赞添加讨论(0) 举报

别忘想泡老子

4楼-- · 2019-07-17 19:49

1) Create a script which takes input path and then deletes the files using hadoop fs -rmr s3path. 2) Upload the script to s3

In emr use the prestep - 1) hadoop fs -copyToLocal s3://scriptname . 2) chmod +x scriptname 3) run script

That pretty much it.

0人赞添加讨论(0) 举报

delete s3 files from a pipeline AWS

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间