While I completely share and actually like Microsoft’s vision of an analytics resource…
“that brings together data integration, enterprise data warehousing and big data analytics”
… the marketing, hype and technical implementation have resulted in a lot of confusion and disappointment.
So, to answer the title of this blog post directly. My opinion, as I write on 29th January 2021, is: No, Azure Synapse Analytics is not ready. Sorry Microsoft, but you’ve had long enough. I can’t hold back the questions and demands from customers anymore on why Synapse still isn’t included in my architecture diagrams.
Now, I’m not here to make such negative statements without some follow up and reasoning. Let’s deal with why…
Why Isn’t Synapse Ready?
From my perspective as an architect considering the integrated experience. Or the single pane of glass as its been described. Let’s breakdown the Synapse technical offerings of what’s under the hood, specifically picking on the parts where we are still struggling with a production ready resource. In no particular order.
Apache Spark Compute
As Microsoft have delivered their own implementation of Spark within Synapse, the offering puts itself in the firing line for comparisons with other Spark implementations delivered by Databricks and even HD Insight.
Straight away we see in Synapse that we currently only have Spark 2.4 available. Screen shot from the Synapse Workspace seen on the right. We don’t know when Microsoft will support Spark 3.0.
Other comparisons between Databricks and Synapse Spark Compute Pools:
- Spark SQL MERGE statements are not supported in Synapse. Only Python syntax is available.
- Table OPTIMIZE functionality isn’t available in Synapse which is a big performance disadvantage when hitting the underlying Parquet files.
I’ve even heard a rumour from a reliable source that workloads in HD Insight – Spark run better than on the Synapse – Spark implementation. Ouch! I haven’t followed up on this claim and tested it myself, so please consider the ‘rumour’ caveat to this statement. Maybe a future blog post.
Orchestration – Integration Pipelines
If you follow my blog, you’ll know I’m a big fan of Azure Data Factory (ADF), so when Synapse revealed the common code base for its orchestration pipelines, including the user interface I was pretty happy. That said, we don’t have feature parity between ADF and Synapse.
- We can’t lift and shift SSIS packages as part of a legacy on premises to Azure migration project into Synapse Integrate Pipelines. The IR compute and pipeline activities simply don’t exist in Synapse. This isn’t a big issue, but it’s one more reason not to move straight to Synapse and means a middle step architecture is required in potential cloud migration projects.
- We don’t have Linked Hosted IR support available. For me this is major deal breaker. Deploying separate IR’s can require a lot of effort. We IR’s that can be shared, and ideally shared from ADF as well.
- Coming at this from the other way round. Synapse has new pipelines activities for the execution of Notebooks against the Apache Spark compute pool. We don’t have those activities available in Data Factory if we wanted to start considering a migration path to Synapse. I’m thinking about a roadmap where we could start doing data transformations in Synapse, while still using existing ADF pipelines, especially considering the other Linked IR limitations mentioned above. Unfortunately not, it has to be a solution that is all Synapse, or nothing.
SQL On-Demand (Serverless)
The ‘serverless’ on demand compute within Synapse is really great, especially considering the shared HIVE Metastore used to surface the external objects. However, the major limiting factor is that it doesn’t support Detla tables. Only HIVE tables created from Parquet files.
Once the serverless compute suppors Delta tables, this can be used.
DevOps is a corner stone in any solution, having resilient development and deployment practices is critical. If we ignore the fact that Microsoft only introduced Git source control support for Synapse last month, what else should we consider here:
- The .Net NuGet packages are still in preview or beta. Although this might be supported by Microsoft, I’ve yet to find an enterprise customer that is happy deploying solution to production where elements are still only in a beta state. Complete namespace details here: https://docs.microsoft.com/en-us/dotnet/api/azure.analytics.synapse.artifacts
- We don’t have a complete set of Terraform functions available yet, only the basic workspace, firewall and compute components. Details here: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/synapse_workspace
- Deployments using ARM Templates can only be done via the Synapse Workspace publish options, similar to Azure Data Factory, with an underlying publish code branch. This means we have to deploy the entire Synapse resource, including all parts, wholesale. No granular control over certain artifacts is available. It seems we need to develop our own approach again using the relevant PowerShell cmdlets: https://docs.microsoft.com/en-us/powershell/module/az.synapse. Like most people have done for Data Factory 😉
Considering everything that Azure Synapse Analytics can do as an integrated resource, the operational monitoring outputs (seen on the right via the Diagnostic Setting panel in the Azure Portal) is still greatly lacking in meaningful metrics that a service management team would expect.
There are a few more options for logs if you view the Spark Pools and SQL Pools separately, but nothing about pipelines, and nowhere near enough auditing content for a production environment.
The renamed Azure SQL Data Warehouse (SQLDW) is without a doubt an excellent, very mature offering. It’s optimized storage, if you follow the marketing, is industry leading given the bench marking work done by Microsoft with the Polaris query engine, whitepaper here: https://aka.ms/synapse-dqp.
However, is this enough of a reason alone to use Synapse? No.
Given the Delta Lake capabilities available in Databricks it becomes really difficult to justify copying all our data out of the Data Lake and redistributing it in a SQLDW table. Copying/moving “big data” is the most expensive and time-consuming part of any platform so leaving the data in the Data Lake makes far more sense, especially when you can partition and aggregate the Data Lake content as Delta Table entities. Furthermore, the Data Lake doesn’t suffer from the same concurrency limitations as the SQLDW compute.
I could also make arguments here about cost when paying for storage twice (Data Lake, SQLDW) and scaled out compute clusters. Leave the data where it is, pay for one cluster. You get the idea.
Sadly, despite there being a lot of really great features within Synapse, including a lot of technical maturity in areas like the SQL engine and pipelines capabilities. When designing a solution, its the weakest links that we have to focus on. As the proverb goes; a chain is only as strong as it’s weakest links.
As Synapse brings all these capabilities together. Then it has to be judged as a resource for production together, or as a whole.
I hope you found this post useful and not too negative.
Many thanks for reading.