Hi all, I get asked this question a lot. So I decided to do a full post on the subject based on my current thoughts and experience.
When should I use multiple Azure Data Factory’s?
Ok, let’s go deeper and explore why it depends!
Also, to be clear, we aren’t talking about having multiple Azure Data Factory’s for Dev/Test/Prod environments.
The obvious and easy reason for having multiple Data Factory’s could be that you simply want to separate your business processes. Maybe they all have separate data delivery requirements and it just makes management of data flows easier to handle. For example:
They could have different data delivery deadlines, they process on different schedules and don’t share any underlying connections.
You may also have multiple projects underway that mean you want to keep teams isolated.
The charging model for Data Factory is one of the most confusing things out of all Azure Resources. This is mainly because Data Factory can do so much and each part of the resource has a different charging metric. That said, calculating what a given orchestration process is going to cost is really tricky.
Microsoft’s Azure Price Calculator: https://azure.microsoft.com/en-gb/pricing/calculator
For some companies that want a really clean way of inter-department recharging having separate Azure Subscriptions is one way you might choose to handle billing. This means at an Azure Resource level each department could have a dedicated Azure Data Factory.
The flip side to that situation is that as a whole you may end up paying more for all Data Factory resources because you aren’t taking advantage of ‘economies of scale’. An example of this could be for things like Data Flow cluster compute or SSIS IR compute.
In a multiple national company regulatory restrictions may mean your data can’t leave country boundaries. Therefore having an Azure Data Factory dedicated to a particular Azure Region could be a factor.
You can overcome this in a single Data Factory using region-specific Azure Integration Runtimes, but that adds a management overhead to deployments of IR’s and aligning specific (copy) Activities to those dedicated IR’s. Therefore, separate factories might make more sense. With multiple Data Factory’s you can leave the default region ‘Auto Resolving’ IR in place without any configuration.
For those that aren’t aware, when performing data movement operations in Data Factory the compute is done at the destination (sink) location. For example, when copying data from Data Lake 1, located in East US to Data Lake 2, located in North Europe. You could say the data will be pulled by Data Factory to North Europe from East US.
Azure Regional data egress charges would also be incurred if you did this.
To clarify this point:
- Multiple region specific Data Factory’s without any configuration overhead.
- One Data Factory with region specific Integration Runtime’s, but with Activity configuration required.
This is a reason not to use multiple Data Factory’s. You might only want to configure/create a single set of Linked Services and have generic Datasets for your source systems. Therefore, keeping everything in one Data Factory means boiler plate code parts is better. Plus, creating generic utility pipelines for sending emails, might be simpler if only developed, tested and deployed once.
Also, Data Factory Template‘s maybe a consideration for common data handling patterns.
Security (PII Data)
The Azure Portal Resource Manager Access Control (IAM) options allow permissions to be assigned at Resource Group and Resource levels. Sometimes referred to as the Management Plane in Azure. However, for Data Factory that is where the granularity of security roles stop. Therefore, if you have access to Data Factory, you have access to everything Data Factory connects to via its Linked Services.
Azure Key Vault offers a layer of separation to free text credentials, but typically Data Factory’s MSI will be able to read all Key Vault secrets so I wouldn’t class this as a complete security layer.
In this situation, you might want to separate your Data Factory’s to control access to the underlying connections and operations. Maybe for the sake of PII data separation or another source system containing sensitive data.
Decoupling Wider Orchestration from Workers
If you’ve been following my Metadata Driven Processing Framework (ADFprocfwk) posts for Data Factory you’ll know that I talk about Worker pipelines. These Worker pipelines contain the Activities specific to the operation we want to perform. It could be a Copy, a Notebook call, etc.
In the context of my processing framework it is perfectly acceptable to separate the Worker pipelines from the boiler plate Parent/Child framework pipelines. This could be because you want the wider solution orchestration and management to be disconnected from the data level processing.
It could also be that you want the common code base to be separated from the specific code. I don’t have any strong arguments why you might do this. But the option is there. Maybe, your organisation is structured in such a way that global operations are handed centrally from the localised operating company’s. In which case the disconnect in Data Factory’s works better for global teams.
If your using ARM Templates to deploy Azure Data Factory you may feel like this whole sale approach isn’t great, and you’d be correct in thinking that. A massive ARM template with 100’s of parameters isn’t a nice thing to have to deal with. In this situation breaking up Data Factory’s could help. However, the better option is to not use the ARM Template’s from the ‘adf_publish’ branch altogether. Instead, use PowerShell for deploying only the Data Factory components you want. This means only the linked services/datasets/pipelines specific to a given release get deployed rather than using the ARM template approach. How? Take the JSON files from your default code branch (or feature branch) and for each file align it to the respective Set- PowerShell cmdlet as the definition file.
Generally this technique of deploying Data Factory parts is much nicer and more controlled than using the ARM Templates. However, is does mean you have to handle component dependencies and removals, if you have any.
- Break up the Data Factory if using ARM Templates to deploy.
- Use PowerShell to control component level releases and keep the single Data Factory.
As you’ve hopefully realised there are several considerations why you might want to use multiple Data Factory resources. These considerations are very circumstantial to your organisation, setup, development team, regulations etc. However, this post outlines what those considerations are given my experience of the service.
In all cases there is the obvious deployment and management overhead of having several orchestration resources that need to be co-ordinated.
Hope this helps.
Many thanks for reading.