Building a Data Mesh Architecture in Azure – Part 6 – Welcome to the Blog of Paul Andrew

Data Mesh vs Azure –
Theory vs practice

Use the tag Data Mesh vs Azure to follow this blog series.

As a reminder, the four data mesh principals:

domain-oriented decentralised data ownership and architecture.
data as a product.
self-serve data infrastructure as a platform.
federated computational governance.

Source reference: https://martinfowler.com/articles/data-mesh-principles.html

A Quick Recap

In Part 1, Part 2 and Part 3 of this blog series we defined our Azure data mesh nodes and edges. With the current conclusion that Azure Resource Groups can house our data product nodes within the mesh and for our edges (interfaces) we’ve established the following working definitions and potential Azure resources:

Primary – data integration/exchange. Provided by a SQL Endpoint.
Secondary – operational reporting and logging. Provided by Azure Log Analytics as a minimum.
Tertiary – resource (network) connectivity. Delivered using Azure VNet Peering in a hub/spoke setup.

Then in Part 4 of the series we explored and concluded how we could template and control the deployment of our nodes using Azure DevOps, Azure Bicep and VS Code.

In part 5, we defined the hierarchy of data domains vs data products and aligned this thinking to Azure Subscriptions vs Azure Resource Groups.

Now in part 6, I want to bring into focus an area of the data mesh architecture that I think many have been struggling with and that doesn’t have a solid technical solution to in terms of practical implementation. This partly relates to the fourth principal…

4. Federated Computational Governance

To breakdown this principal I want to consider, very simply, the input, process and output of the products within our data mesh. We could also describe this as extract, transform and load if you like. For any given use where the goal is data insight this is well establish. However, I see the danger of delivering this using decentralised processing principals is that we create silos of data processing, silo’d data outputs. Silo’s that mean we lose the natural data insights found when we blend disparate data sources via a business focused data model. To address this head on as a problem statement:

Problem

How can we offer a self-service endpoint/business focused data model to allow natural data exploration and unlock data insight in a decentralised data mesh architecture?

I touched on this problem in the first part of this blog series, where I articulated the concern as follows:

If we deliver a self-serve data infrastructure, what data model should be targeted and where should it be built (without it becoming a monolith data warehouse amongst the mesh). Should this be part of the suggested virtualisation layer? Or is a single data model not the goal at the data product level?
Building a Data Mesh Architecture in Azure – Part 1

The data mesh theory suggests that data modelling should be left to the data domains rather than trying to create a complete canonical model for the enterprise. Fine. But. In the context of our data domains and data products, there is a need to go beyond the initial (domain local) data models. Something more is required here. A semantic layer for business users, people that aren’t technical and might not have domain knowledge, but just want to explore. Famously described by Microsoft’s Christian Wade as “clicky, clicky, draggy, droppy.” 🙂

The short cut to this problem might be a technical solution that bolts on a semantic layer the brings together all data product outputs into a business wide data model. Crudely drawn below.

So, in practice, does this means we need to go against the decentralised mesh theory? Because in reality, we want a centralised canonical model. That said, can still honour the decentralised principals, without creating a monolith data warehouse on the side of our mesh?

Solution (Current thinking)

Continuing the theme of this series I want to move from theory to practice and consider how we deliver this requirement.

To start, we can describe the existence of these technical output components within the data mesh as multi plane components. Multi plane components that sit across data domains and span our established interface planes to support such a canonical model, or many models.

These multi plane components exist to deliver (unlock) self-service, data exploration and data insight from blended domain sources.

In all cases, these multiple plane output components should offer the following characteristics:

Not persist (copy) datasets into an output focused data store.
Not allow write back access to the underlying data source(s).
Inherite data governance rules from the data domains.
Support querying from non-technical business user persona’s.
Perform well, without disruptive amounts of data refresh latency.
Offer sharing of outputs with other users.

When working on the Microsoft cloud platform, the next obvious question becomes; what Azure Resource should we use? Spoiler alert, as we sit, here and now, I’m not sure! I see two possible technology choices (that we can argue about):

Azure Synapse Analytics – created using a flavour of SQL compute (ideally serverless) to abstract domain local data via external tables into a new multi plane serving layer model.
Analysis Services – intentionally dropping the ‘Azure’ prefix as this Resource could exist in Azure or as part of the Power Platform stack. In either case, delivering a model in direct query mode that abstracts domain local data into a nicely formed semantic layer.

If we stay within the Microsoft product offerings, today, neither resource feels like a good fit considering the characteristics set out in the bullet points above. Maybe we need to consider a third party offering. Maybe a combination of resources is required.

For data products delivered using a Delta Lake as the underlying format, create a serving model using Synapse SQL Serverless endpoints.
For data products delivered using SQL endpoints, create an Analysis Services model.
For data products delivered using simple Parquet file outputs, create SQL Pool external tables to serve/combine sources.

Or, could we do something else? As stated. Today. I’m not sure. Maybe in the future we will have a better technical answer. To be continued 😉

Many thanks for reading.

2 thoughts on “Building a Data Mesh Architecture in Azure – Part 6”

Pingback: Building a Data Mesh Architecture in Azure – Part 7 – Welcome to the Community Blog of Paul Andrew
Marco Ullasci says:

October 19, 2022 at 4:44 am

Can you elaborate on the reason(s) why the multiple-plane output components should “Not persist (copy) datasets into an output focused data store” in your opinion?

In my experience this approach is not very effective to display the “Perform well” characteristic, but we might have very different metrics for the concurrency and latency we associate with good performance.

LikeLike