Best Practices for Implementing Azure Data Factory

My colleagues and friends from the community keep asking me the same thing… What are the best practices from using Azure Data Factory? With any emerging, rapidly changing technology I’m always hesitant about the answer. However, after 4 years of working with ADF I think its time to start suggesting what I’d expect to see in any good Data Factory, one that is running in production as part of a wider data platform solution. We can call this technical standards or best practices if you like. Either, I would phrase the question; what makes a good Data Factory implementation?

The following are my suggested answer to this and what I think makes a good Data Factory. Hopefully together we can mature these things into a common set of best practices or industry standards when using the tech. Some of the below statements might seem obvious and fairly simple. But maybe not to everybody, especially if your new to ADF. As an overview, I’m going to cover the following points:

  • Environment Setup & Developer Debugging
  • Naming Conventions
  • Pipeline Hierarchies
  • Pipeline & Activity Descriptions
  • Factory Component Folders
  • Linked Service Security via Azure Key Vault
  • Generic Datasets
  • Metadata Drive Processing
  • Parallel Execution
  • Hosted Integration Runtimes
  • Azure Integration Runtimes
  • Wider Platform Orchestration
  • Custom Error Handler Paths
  • Monitoring via Log Analytics
  • Using Templates
  • Documentation

Let’s start, my ADF best practices:


Environment Setup & Developer Debugging

Having a clean separation of resources for development, testing and production. Obvious for any solution, but when applying this to ADF, I’d expect to see the development service connected to source control as a minimum. Using Azure DevOps or GitHub doesn’t matter, although authentication against Azure DevOps is slightly simpler. Then, that development service should be used with multiple code repository branches that align to backlog features. Finally, I’d expect developers working within those code branches to be using the ADF debug feature to perform functional testing of pipelines and using break points on activities as required. Pull requests of feature branches would be peer reviewed before merging into the main delivery branch and published to the development Data Factory service. Having that separation of debug and development is important to understand for that first Data Factory service and even more important to get it connected to a source code system.

For clarification, other downstream environments (test, UAT, production) do not need to be connected to source control.

Naming Conventions

Hopefully we all understand the purpose of good naming conventions for any resource. However, when applied to Data Factory I believe this is even more important given the expected umbrella service status ADF normally has within a wider solution. Firstly, we need to be aware of the rules enforced by Microsoft for different components, here:

https://docs.microsoft.com/en-us/azure/data-factory/naming-rules

Unfortunately there are some inconsistencies to be aware of between components and what characters can/can’t be used. Once considered we can label things as we see fit. Ideally without being too cryptic and while still maintaining a degree of human readability.

For example, a linked service to an Azure Functions App, we know from the icon and the linked service type what resource is being called. So we can omit that part. As a general rule I think understanding why we have something is a better approach when naming it, rather than what it is. What can be inferred with its context.

Finally, when considering components names, be mindful that when injecting expressions into things, some parts of Data Factory don’t like spaces or things from names that could later break the JSON expression syntax.

Pipeline Hierarchies

I’ve blogged about the adoption of pipeline hierarchies as a pattern before (here) so I won’t go into too much detail again. Other than to say its a great technique for structuring any Data Factory, having used it on several different projects…

  • Grandparent
  • Parent
  • Child
  • Infant

Pipeline & Activity Descriptions

Every Pipeline and Activity within Data Factory has a none mandatory description field. I want to encourage all of us to start making better use of it. When writing any other code we typically add comments to things to offer others our understanding. I want to see these description fields used in ADF in the same way. A good naming convention gets us partly there with this understanding, now let’s enriched our Data Factory’s with descriptions too. Again, explaining why and how we did something. In a later post I’m planning to parse the JSON from a given Data Factory ARM template, extract the description values and make the service self documenting.

Factory Component Folders

Folders and sub-folders are such a great way to organise our Data Factory components, we should all be using them to help ease of navigation. Be warned though, these folders are only used when working with the Data Factory portal UI. They are not reflected in the structure of our source code repo.

Adding components to folders is a very simple drag and drop exercise or can be done in bulk if you want to attack the underlying JSON directly. Subfolders get applied using a forward slash, just like other file paths.

Also be warned, if developers working in separate code branches move things affecting or changing the same folders you’ll get conflicts in the code just like other resources. Or in some cases I’ve seen duplicate folders created where the removal of a folder couldn’t naturally happen in a pull request. That said, I recommend organising your folders early on in the setup of your Data Factory. Typically for customers I would name folders according to the business processes they relate to.

Linked Service Security via Azure Key Vault

Azure Key Vault is now a core component of any solution, it should be in place holding the credentials for all our service interactions. In the case of Data Factory most linked service connections support the obtaining of values from Key Vault. Where ever possible we should be including this extra layer of security and allowing only Data Factory to retrieve secrets from Key Vault using its own Managed Identity.

If you aren’t familiar with this approach check out this Microsoft Doc pages:

https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault

Be aware that when working with custom activities in ADF using Key Vault is essential as the Azure Batch application can’t inherit credentials from the Data Factory linked service references.

Generic Datasets

Where design allows it I always try to simplify the number of datasets listed in a Data Factory. In version 1 of the product a hard coded dataset was required as the input and output for every stage in our processing. Thankfully those days are in the past. Now we can use a completely metadata driven dataset for dealing with a particular type of object against a linked service. For example, one dataset of all CSV files from Blob Storage and one dataset for all SQLDB tables.

At runtime the dynamic content underneath the datasets are created in full so monitoring is not impacted by making datasets generic. If anything, debugging becomes easier because of the common/reusable code.

Where generic datasets are used I’d expect the following values to be passed as parameters. Typically from the pipeline, or resolved at runtime within the pipeline.

  • Location – the file path, table location or storage container.
  • Name – the file or table name.
  • Structure – the attributes available provided as an array at runtime.

To be clear, I wouldn’t go as far as making the linked services dynamic. Unless we were really confident in your controls and credential handling.

Metadata Driven Processing

Building on our understanding of generic datasets, a good Data Factory should include (where possible) generic pipelines, these are driven from metadata to simplify (as a minimum) data ingestion operations. Typically I use an Azure SQLDB to house my metadata with stored procedures that get called via Lookup activities to return everything a pipeline needs to know.

This metadata driven approach means deployments to Data Factory for new data sources are greatly reduced and only adding new values to a database table is required. The pipeline itself doesn’t need to be complicated. Copying CSV files from a local file server to Data Lake Storage could be done with just three activities, shown below.

Parallel Execution

Given the scalability of the Azure platform we should utilise that capability wherever possible. When working with Data Factory the ‘ForEach’ activity is a really simple way to achieve the parallel execution of its inner activities. By default, the ForEach activity does not run sequentially, it will spawn 20 parallel threads and start them all at once. Great! It also has a maximum batch count of 50 threads if you want to scale things out even further. I recommend taking advantage of this behaviour and wrapping all pipelines in ForEach activities where possible.

In case you weren’t aware within the ForEach activity you need to use the syntax @{item().SomeArrayValue} to access values from the array used to initialise parent.

Hosted Integration Runtimes

Currently if we want Data Factory to access our on premises resources we need to use the Hosted Integration runtime (previously called the Data Management Gateway). When doing so I suggest the following two things be taken into account as good practice:

  1. Add multiple nodes to the hosted IR connection to offer the automatic failover and load balancing of uploads. Also, make sure you throttle the currency limits of your secondary nodes if the VM’s don’t have the same resources as the primary node. More details here: https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime
  2. When using Express Route make sure the VM’s running the service are on the correct side of the network boundary. If you upgrade to Express Route later in the project and the Hosted IR’s have been installed on local Windows boxes, they will probably need to be moved. Consider this in your future architecture and upgrade plans.

Azure Integration Runtimes

When turning our attention to the Azure flavour of the Integration Runtime I typically like to update this by removing its freedom to auto resolve to any Azure Region. There can be many reasons for this; regulatory etc. But as a starting point, I simply do trust it not to charge me data egress costs if I know which region the data is being stored.

Also, given the new Data Flow features of Data Factory we need to consider updating the cluster sizes set and maybe having multiple Azure IR’s for different Data Flow workloads.

In both cases these options can easily be changed via the portal and a nice description added. Please make sure you tweak these things before deploying to production and align Data Flows to the correct clusters in the pipeline activities.

Wider Platform Orchestration

In Azure we need to design for cost, I never pay my own Azure Subscription bills, but even so. We should all feel accountable for wasting money. To that end, pipelines should be created with activities to control the scaling of our wider solution resources.

  • For a SQLDB, scale it up before processing and scale it down once finished.
  • For a SQLDW, start the cluster before processing, maybe scale it out too.
  • For Analysis service, resume the service to process the models and pause it after. Maybe, how a dedicated pipeline that pauses the service outside of office hours.
  • For Databricks, create a linked services that uses job clusters.

You get the idea.

Building pipelines that don’t waste money in Azure Consumption costs is a practice that I want to make standard. I go into greater detail on the SQLDB example in a previous blog post, here.

Custom Error Handler Paths

Our Data Factory pipelines, just like our SSIS packages deserve some custom logging and error paths that give operational teams the detail needed to fix failures. For me, these boiler plate handlers should be wrapped up as ‘Infant’ pipelines and accept a simple set of details:

  • Calling pipeline name.
  • Run ID of the failed execution.
  • Custom details coded into the process.

Everything else can be inferred or resolved by the error handler.

Once established, we need to ensure that the processing routes within the parent pipeline are connected correctly. OR not AND. All too often I see error paths not executed because the developer is expecting activity 1 AND activity 2 AND activity 3 to fail before its called. Please don’t make this same mistake. Shown below.

Monitoring via Log Analytics

Like most Azure Resources we have the ability via the ‘Diagnostic Settings’ to output telemetry to Log Analytics. The out-of-box monitoring area within Data Factory is handy, but it doesn’t deal with any complexity. Having the metrics going to Log Analytics as well is a must have for all good factories. The experience is far richer and allows operational dashboards to be created for any/all Data Factory’s.

Screen snippet of the custom query builder shown below, click to enlarge.

Using Templates

Pipeline templates I think are a fairly under used feature within Data Factory. They can be really powerful when needing to reuse a set of activities that only have to be provided with new linked service details. And if nothing else, getting Data Factory to create SVG’s of your pipelines is really handy for documentation too. I’ve even used templates in the past to snapshot pipelines when source code versioning wasn’t available. A total hack, but it worked well.

Like the other components in Data Factory template files are stored as JSON within our code repository. Each template will have a manifest.json file that contains the vector graphic and details about the pipeline that has been captured. Give them a try people.

Documentation

Lastly, for this blog post! Do I really need to call this out as a best practice?? Every good Data Factory should be documented. As a minimum we need somewhere to capture the business process dependencies of our Data Factory pipelines. We can’t see this easily when looking at a portal of folders and triggers, trying to work out what goes first and what goes upstream of a new process. I use Visio a lot and this seems to be the perfect place to create (what I’m going to call) our Data Factory Pipeline Maps. Maps of how all our orchestration hang together. Furthermore, if we created this in Data Factory the layout of the child pipelines can’t be saved, so its much easier to visualise in Visio. I recommend we all start creating something similar to the example below.


That’s all folks. If you have done all of the above when implementing Azure Data Factory then I salute you 🙂

Many thanks for reading. Comments and thoughts very welcome. Defining best practices is always hard and I’m more than happy to go first and get them wrong a few times while we figure it out together 🙂

 

9 thoughts on “Best Practices for Implementing Azure Data Factory

    1. That’s an interesting one… are you testing the ADF pipeline? Or are you actually testing whatever service the ADF pipeline has invoked? If the former and you don’t care about the service called what are you even testing for?

      Like

      1. We are currently not doing any pipeline tests, just looking at how we may do it. We would mainly be interested in integration tests with the proper underlying services being called, but I guess we could also parameterize the pipelines sufficiently that we could use mock services and only test the pipeline logic, as a sort of unit test. The thinking so far is to have a separate folder in ADF for “test pipelines” that invoke other pipelines and check their output, then script the execution of the test pipelines in a CI build.

        BTW., Thanks for the very nice writeup in the original article!

        Like

  1. Excellent post and great timing as I’m currently getting stuck into ADF. We are following most of the above, but will definitely be changing a few bits after reading you’re post. One point we are unsure of is if we should be setting up a Data Factory per business process or one mega Factory and use the folders to separate the objects.

    Current thoughts are to create a factory for each distinct area (so one for Data Warehouse, one for External File delivery etc.) then one extra factory just containing the integration runtimes to our on-prem data that are shared to each factory when needed. From a CI pipeline point of view then, with separate factories one release doesn’t hold up the other. We don’t see us having more than about 10 distinct areas overall.

    Be great to hear your thoughts on best practice for this point. Thanks again for the great post.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.