BigQuery to Hadoop Cluster - How to transfer data?

I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.

I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.

My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.

Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.

What architecture/design/general factors would I need to consider while planning for this ETL process?

Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.

Thanks in advance.

标签： python hadoop google-analytics google-bigquery etl

2条回答

三岁会撩人

2楼-- · 2019-07-21 04:49

Check out Oozie. It seems to fit your requirements. It has workflow engine, scheduling support and shell script and hive support.

In terms of installation and deployment, it's usually part of hadoop distribution, but can be installed separately. It has a dependency of db as persistence layer. That may require some extra efforts.

It has web UI and rest API. Managing and monitoring jobs could be automated if desired.

0人赞添加讨论(0) 举报

爷的心禁止访问

3楼-- · 2019-07-21 05:12

The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop

https://cloud.google.com/hadoop/bigquery-connector

This connector defines a BigQueryInputFormat class.

Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.

(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)

0人赞添加讨论(0) 举报

BigQuery to Hadoop Cluster - How to transfer data?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间