Addressing Hype, Problems Faced & Common Misconceptions
As I continue my day job supporting customers in their journey towards a true data mesh architecture, I’ve started to see a series of common questions. Or a set of common problems that I get asked. Frequently asked questions (FAQ’s) if you like 🙂
For clarity, I am moving on from the initial questions, like: ‘what is a data mesh’ and ‘why do I need one’. Moving on to better questions, more informed questions, which will shape how we go about implementing a data mesh architecture.
It is these (next stage) frequent questions about Data Mesh that I have decided would make for a useful post in this blog series. In no particular order and paraphrased slightly to hopefully add maximum value for the community.
My Top 12 Data Mesh FAQs based on real world experiences so far.
- What technology should I use for my data product storage layer?
- Is every Data Mesh implementation the same?
- What frontend tool can be used to the support the citizen data engineers (business users) in the mesh supervision plane?
- What technology can we use to deliver insight across the Data Mesh?
- Should all Data Mesh capabilities be de-centralised?
- When does a data platform implementation become a Data Mesh?
- How long will it take to implement a Data Mesh?
- What is the difference between Data Mesh and Data Fabric?
- Should a data product handle both transactional/operational and analytical data?
- What makes minimum viable Data Mesh contain?
- Is Data Mesh just hype or could we make it a reality?
- Should we be thinking about all data, as a product?
What technology should I use for my data product storage layer?
Answer: Data Mesh is about so much more than technology. If asking this question, there is probably a wider disconnection in the understanding of the goal behind delivering a data mesh architecture. To be simplistic with the answer, use a Data Lake for storage. But what is the use case you are trying to solve for. A better answer might be to use a Data Lake, but setup using the Delta Lake open-source format. However, this still needs to be use case driven based on the requirements of the data product. It’s more important that we have common interfaces for the data products and interfaces across the mesh that make the data accessible, regardless of the underlying product storage.
Is every Data Mesh implementation the same?
Answer: No. Certainly not. Why? Because every organisation is not the same. Data Mesh is about people, process, and technology. Therefore, how can one or more Data Mesh implementations be the same. There might be similarities within a given industry vertical eg. common healthcare domains and outputs. But, beyond that there is not a ‘cookie cooker’ answer to delivering a Data Mesh. I would also go as far as wanting to strongly challenge those that think otherwise.
What frontend tool can be used to the support the citizen data engineers (business users) in the mesh supervision plane?
Answer: This is hardest problem we currently must solve. We firstly need to consider what cloud vendor technologies are available. Then what other third-party offerings might be on the market to support us. The mesh supervision plane is a goal that could/should include many capabilities. In an ideal world, with enough time and resources, I would like to build my own as a browser based ‘software as a service’ offering. Something that includes a marketplace of assets – covering analytics self-service as well as infrastructure self-service.
What technology can we use to deliver insight across the Data Mesh?
Answer: Building on the previous question slightly here. But with a focus on data insight and implicitly data analytics. To start with, I like the old ways of serving data through a semantic layer (presentation layer). Meaning the business user doesn’t need to worry about entity relationships or metric calculations. Everything is available in a nice drag-and-drop form. That said, when we scale out across a decentralised data mesh, the traditional semantic layer technologies are going to struggle. Data virtualisation seems to be the latest and best answer to this problem that I can currently offer. I have explored several tools and would currently recommend either Pyramid or Denodo. Or something built in house using Power BI Premium common data models, for a smaller set of data products/domains.
Should all Data Mesh capabilities be de-centralised?
Answer: No. We want data products and domains to become scalable to increase business velocity. But those scalable elements of our architecture are still going to require a core set of centralised/foundation services. Including things like identity management.
When does a data platform implementation become a Data Mesh?
Answer: Can we really treat this as an evolution from data platform to Data Mesh? I don’t think so. A much bigger set of re-organisation/refactoring is often required if you have a large monolith data platform. Again, this is not just a technical solution. Maybe an existing data platform solution could be refactored into a fledgling data product alongside a set of newly defined governance and processes. Or, another approach could be to start implementing a new set of data sharing interfaces over an existing data platform to support its inclusion into a wider plane of interaction for data serving.
How long will it take to implement a Data Mesh?
Answer: I’ve got to play the consultant ‘it depends’ card on this one. How long is a piece of string. It often comes down to the level of maturity in the existing technology estate. There are many factors and variables that can be used to inform an answer. But it can take an army of business analysts to complete that assessment before we even start the data mesh journey. The actual answer is very likely going to be in years. Not months. To offer some perspective, Microsoft have created an internal data mesh that took around 3.5 years to build/organise covering the three pillars of people, process and technology.
What is the difference between Data Mesh and Data Fabric?
Answer: This could turn into a exceptionally long answer and one that almost needs a separate blog post as there is too much knowledge and information to try and distil here. However, to be overly simplistic (no trolls please), data fabric is about a common method for data integration, a framework if you like. Whereas data mesh is about a decentralised architecture. The two high level concepts are NOT mutually exclusive. Asking the difference between them is probably a clue that the requirements aren’t fully understood. Or, that people are just playing buzz word bingo!
Should a data product handle both transactional/operational and analytical data?
Answer: In my opinion, yes. Pipelines are known to be the enemy or an anti-pattern for data mesh implementations. Therefore, I see both OLTP and OLAP workloads being delivered as part of a given data product. A purely analytical data product will need to ingest data from somewhere, therefore needing a pipeline technology to support the data load. Assuming the obvious contradictionin in this. Of course, we don’t live in a perfect world of ideals. But I think we should aim for both data types “living together”, or at least in a low cohesion highly coupled ecosystem (compute science theory implied).
To consider a vendor offering as an answer to this. Microsoft’s Cosmos DB with Synapse Link enabled to Synapse Analytics. Not perfect. But it would support the requirement for frictionless data ingestion in a single data product. HTAP as Microsoft like to call it.
What makes minimum viable Data Mesh contain?
Answer: I have thought about this question a lot. Accepting the bar to entry for data mesh is high. But how high? How do we quantify or describe a minimum viable mesh? To be brave and go first as I often do. I will say we need the following:
- Two functioning data products with common interfaces and outward technical consistency.
- A core of foundational services supporting user authentication and data exploration across the (foundling) data products.
- A runbook for data product onboarding into the mesh. By any technical means aligned to current operational standards.
- Practices for metadata handling and data governance. Including sensitive data if applicable.
- A Wiki or similar set of living documentation describing the data mesh to users in both technical and nontechnical terms. Allowing for interaction and ultimately the use of the mesh platform.
I am going to force myself to stop at a distilled set of five bullet points to offer an element of pragmatism here and keeping with the caveat – a minimum viable data mesh.
Is Data Mesh just hype or could we make it a reality?
Answer: The hype around data mesh is now borderline negative marketing. Data Mesh is about a north star and vision of what could be done. One that I like. That said, the bar to entry is so high it may mean some organisations qualify out before even trying to implement a data mesh. Or, they simply can’t invest in enough of the frontloaded elements required for a true data mesh to make it a viable business case.
Should we be thinking about all data, as a product?
Answer: Yes. But only if that definition of a data product is frontloaded and agreement upon across the organisation. Additionally, we need to be careful to align terminology. What might already be defined as a dataset vs a data table. Even a database. Working definitions are very important to have and creating them collectively will help mature the data culture for the wider business when moving towards a data mesh architecture.
I hope you found this slightly dry blog post useful. It’s this thinking that I’d like us to do together as a community. So I would welcome your feedback. What questions are you facing when implementing a data mesh?
Many thanks for reading.