I have a spark batch job which is executed hourly. Each run generates and stores new data in S3
with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile
.
After uploading the data to S3
, I want to investigate them using Athena
. More, I would like to visualize them in QuickSight
by connecting to Athena as a data source.
The problem is that, after each run of my Spark batch, the newly generated data stored in S3
will not be discovered by Athena, unless I manually run the query MSCK REPARI TABLE
.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
You should be running
ADD PARTITION
instead:Which adds a the newly created partition from your
S3
location Athena leverages Hive for partitioning data. To create a table with partitions, you must define it during theCREATE TABLE
statement. UsePARTITIONED BY
to define the keys by which to partition data.There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls
MSCK REPAIR TABLE some_database.some_table
in response to a new upload to S3.An example Lambda Function could be written as such:
You would then configure a trigger to execute your Lambda function when new data are added under the
DATA/
prefix in your bucket.Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.