I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.
I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.
My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.
Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.
What architecture/design/general factors would I need to consider while planning for this ETL process?
Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.
Thanks in advance.