Data Governance solution for Databricks, Synapse a

2020-06-28 01:15发布

问题:

I'm new to data governance, forgive me if question lack some information.

Objective

We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.

We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.

Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.

Question

What is the best Data Governance solution for our stack and requirements?

My workarrounds

I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.

After very quick googling I found three options:

  1. Databricks Privacera
  2. Databricks Immuta
  3. Apache Ranger & Apache Atlas.

Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?

What are the reasons to prefer Privacera over Immuta and vice versa?

Are there any other options I should evaluate?

What is already done

From Data Governance perspective we have done only the following things:

  1. Define data zones inside ADLS
  2. Apply encryption/obfuscation for sensitive data (due to GDPR requirements).
  3. Implemented Row-Level Security (RLS) at Synapse and Power BI layers
  4. Custom audit framework for logging what & when was persisted

Things to be done

  1. Data lineage and single source of truth. Even at 4 months from the start, it become a pain-point to understand dependencies between data sets. The lineage information is stored inside Confluence, it's hard to maintain and continuously update in multiple places. Even now it's outdated in some places.
  2. Security. Business users may do some data exploration in Databricks Notebooks in future. We need RLS for Databricks.
  3. Data Life Cycle management.
  4. Maybe other data governance related stuff, such as data quality, etc.

回答1:

To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks; a related Databricks video demo; and other data governance tutorials.

Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.