How to quickly create mechanism that reads json data from Amazon SQS and saves it in avro files (may be other format) in s3 bucket, partitioned by date and value of given field in json message?
问题:
回答1:
You can write an AWS Lambda function that gets triggered by a message being sent to an Amazon SQS queue. You are responsible for writing that code, so the answer is that it depends on your coding skill.
However, if each message is processed individually, you will end up with one Amazon S3 object per SQS message, which is quite inefficient to process. The fact that the file is in Avro format is irrelevant because each file will be quite small. This will add a lot of overhead when processing the files.
An alternative could be to send the messages to an Amazon Kinesis Data Stream, which can aggregate messages together by size (eg every 5MB) or time (eg every 5 minutes). This will result in fewer, larger objects in S3 but they will not be partitioned, nor in Avro format.
To get the best performance out of a columnar format like Avro, combine the data into larger files that will be more efficient for processing. So, for example, you could use Kinesis for collecting the data, then a daily Amazon EMR job to combine those files into partitioned Avro files.
So, the answer is: "It's pretty easy, but you probably don't want to do it."
Your question does not define how the data gets into SQS. If, rather than processing messages as soon as they arrive, you are willing for the data to accumulate in SQS for some period of time (eg 1 hour or 1 day), you could then write a program that reads all of the messages and outputs them into partitioned Avro files. This uses SQS as a temporary holding area, allowing data to accumulate before being processed. However, it would lose any real-time reporting aspect.