Best Practices for Implementing Azure Data Factory

Blog post updated 9th November 2020

My colleagues and friends from the community keep asking me the same thing… What are the best practices from using Azure Data Factory (ADF)? With any emerging, rapidly changing technology I’m always hesitant about the answer. However, after 5 years of working with ADF I think its time to start suggesting what I’d expect to see in any good Data Factory, one that is running in production as part of a wider data platform solution. We can call this technical standards or best practices if you like. In either case, I would phrase the question; what makes a good Data Factory implementation?

The following are my suggested answers to this and what I think makes a good Data Factory. Hopefully together we can mature these things into a common set of best practices or industry standards when using the cloud resource. Some of the below statements might seem obvious and fairly simple. But maybe not to everybody, especially if your new to ADF. As an overview, I’m going to cover the following points:

  • Environment Setup & Developer Debugging (updated)
  • Multiple Data Factory Instance’s (new)
  • Deployments (updated)
  • Automated Testing
  • Naming Conventions
  • Pipeline Hierarchies
  • Pipeline & Activity Descriptions
  • Annotations (new)
  • Factory Component Folders
  • Linked Service Security via Azure Key Vault
  • Security Custom Roles (new)
  • Dynamic Linked Services
  • Generic Datasets
  • Metadata Driven Processing (updated)
  • Parallel Execution (updated)
  • Hosted Integration Runtimes (updated)
  • Azure Integration Runtimes
  • Wider Platform Orchestration
  • Custom Error Handler Paths
  • Monitoring via Log Analytics
  • Timeouts & Retry (new)
  • Service Limitations
  • Using Templates (updated)
  • Documentation
See also:

(new)
As a side project I recently created a PowerShell script to inspect an ARM template export for a given Data Factory instance and provide a summarised check list coverings a partial set of the bullet points on the right, view the blog post here.

Let’s start, my set of Data Factory best practices:


Environment Setup & Developer Debugging

Having a clean separation of resources for development, testing and production. Obvious for any solution, but when applying this to ADF, I’d expect to see the development service connected to source control as a minimum. Using Azure DevOps or GitHub doesn’t matter, although authentication against Azure DevOps is slightly simpler within the same tenant. Then, that development service should be used with multiple code repository branches that align to backlog features. Next, I’d expect developers working within those code branches to be using the ADF debug feature to perform basic end to end testing of newly created pipelines and using break points on activities as required. Pull requests of feature branches would be peer reviewed before merging into the main delivery branch and published to the development Data Factory service. Having that separation of debug and development is important to understand for that first Data Factory service and even more important to get it connected to a source code system.

For clarification, other downstream environments (test, UAT, production) do not need to be connected to source control.

Final thoughts on environment setup. Another option and technique I’ve used in the past is to handle different environment setups internally to Data Factory via a Switch activity. In this situation a central variable controls which activity path is used at runtime. For example, having different Databricks clusters and Linked Services connected to different environment activities:

This is probably a special case and nesting activities via a ‘Switch’ does come with some drawbacks. This is not a best practice, but an alternative approach you might want to consider. I blogged about this in more detail here.

Multiple Data Factory Instance’s

Question: when should I use multiple Data Factory instances?

Answer(s):

  1. To separate business processes (sales, finance, HR).
  2. For Azure cost handling and consumption.
  3. Due to regional regulations.
  4. When handling boiler plate code.
  5. For a cleaner security model.
  6. Decoupling worker pipelines from wider bootstrap processes.

In all cases these answers aren’t hard rules, more a set of guide lines to consider. I’ve elaborated on each situation as I’ve encountered them it in the following blog here. Even if you do create multiple Data Factory instances, some resource limitations are handling at the subscription level, so be careful.

Deployments

Leading on from our environment setup the next thing to call out is how we handle our Data Factory deployments. The obvious choice might be to use ARM templates. However, this isn’t what I’d recommend as an approach (sorry Microsoft). The ARM templates are fine for a complete deployment of everything in your Data Factory, maybe for the first time, but they don’t offer any granular control over specific components and by default will only expose Linked Service values as parameters.

My approach for deploying Data Factory would be to use PowerShell cmdlets and the JSON definition files found in your source code repository, this would also be supported by a config file of component lists you want to deploy. Generally this technique of deploying Data Factory parts with a 1:1 between PowerShell cmdlets and JSON files offers much more control and options for dynamically changing any parts of the JSON at deployment time. But, is does mean you have to manually handle component dependencies and removals, if you have any. A quick visual of the approach:

To elaborate, the PowerShell uses the artifacts created by Data Factory in the ‘normal’ repository code branches (not the adf_publish branch). Then for each component provides this via a configurable list as a definition file to the respective PowerShell cmdlets. The cmdlets use the DefinitionFile parameter to set exactly what you want in your Data Factory given what was created by the repo connect instance.

Sudo PowerShell and JSON example below building on the visual representation above, click to enlarge.

If you think you’d like to use this approach, but don’t want to write all the PowerShell yourself, great news, my friend and colleague Kamil Nowinski has done it for you in the form of a PowerShell module (azure.datafactory.tools). Check out his GitHub repository here. This also can now handle dependencies.

I’ve attempted to summarise the key questions you probably need to ask yourself when thinking about Data Factory deployments in the following slide.

Automated Testing

To complete our best practices for environments and deployments we need to consider testing. Given the nature of Data Factory as a cloud service and an orchestrator what should be tested often sparks a lot of debate. Are we testing the pipeline code itself, or what the pipeline has done in terms of outputs? Is the business logic in the pipeline or wrapped up in an external service that the pipeline is calling?

The other problem is that a pipeline will need to be published/deployed in your Data Factory instance before any external testing tools can execute it as a pipeline run/trigger. This then leads to a chicken/egg situation of wanting to test before publishing/deploying, but not being able to access your Data Factory components in an automated way via the debug area of the resource.

Currently my stance is simple:

  • Perform basic testing using the repository connected Data Factory debug area and development environment.
  • Deploy all your components to your Data Factory test instance. This could be in your wider test environment or as a dedicated instance of ADF just for testing publish pipelines.
  • Run everything end to end (if you can) and see what breaks.
  • Inspect activity inputs and outputs where possible and especially where expressions are influencing pipeline behaviour.

Another friend and ex-colleague Richard Swinbank has a great blog series on running these pipeline tests using an NUnit project in Visual Studio. Check it out here.

Naming Conventions

Hopefully we all understand the purpose of good naming conventions for any resource. However, when applied to Data Factory I believe this is even more important given the expected umbrella service status ADF normally has within a wider solution. Firstly, we need to be aware of the rules enforced by Microsoft for different components, here:

https://docs.microsoft.com/en-us/azure/data-factory/naming-rules

Unfortunately there are some inconsistencies to be aware of between components and what characters can/can’t be used. Once considered we can label things as we see fit. Ideally without being too cryptic and while still maintaining a degree of human readability.

For example, a linked service to an Azure Functions App, we know from the icon and the linked service type what resource is being called. So we can omit that part. As a general rule I think understanding why we have something is a better approach when naming it, rather than what it is. What can be inferred with its context.

Finally, when considering components names, be mindful that when injecting expressions into things, some parts of Data Factory don’t like spaces or things from names that could later break the JSON expression syntax.

Pipeline Hierarchies

I’ve blogged about the adoption of pipeline hierarchies as a pattern before (here) so I won’t go into too much detail again. Other than to say its a great technique for structuring any Data Factory, having used it on several different projects…

  • Grandparent
  • Parent
  • Child
  • Infant

Pipeline & Activity Descriptions

Every Pipeline and Activity within Data Factory has a none mandatory description field. I want to encourage all of us to start making better use of it. When writing any other code we typically add comments to things to offer others an insight into our original thinking or the reasons behind doing something. I want to see these description fields used in ADF in the same way. A good naming convention gets us partly there with this understanding, now let’s enrich our Data Factory’s with descriptions too. Again, explaining why and how we did something. In a this blog post I show you how to parse the JSON from a given Data Factory ARM template, extract the description values and make the service a little more self documenting.

Annotations

All components within Data Factory now support adding annotations. This builds on the description content by adding information about ‘what’ your pipeline is doing as well as ‘why’. If you treat annotations like tags on a YouTube video then they can be very helpful when searching for related resources, including looking around a source code repository (where ADF UI component folders aren’t shown).

Another key benefit of adding annotations is that they can be used for filtering within the Data Factory monitoring screen at a pipelines level, shown below:

Factory Component Folders

Folders and sub-folders are such a great way to organise our Data Factory components, we should all be using them to help ease of navigation. Be warned though, these folders are only used when working within the Data Factory portal UI. They are not reflected in the structure of our source code repository or in the monitoring view.

Adding components to folders is a very simple drag and drop exercise or can be done in bulk if you want to attack the underlying JSON directly. Subfolders get applied using a forward slash, just like other file paths.

Also be warned, if developers working in separate code branches move things affecting or changing the same folders you’ll get conflicts in the code just like other resources. Or in some cases I’ve seen duplicate folders created where the removal of a folder couldn’t naturally happen in a pull request. That said, I recommend organising your folders early on in the setup of your Data Factory. Typically for customers I would name folders according to the business processes they relate to.

Linked Service Security via Azure Key Vault

Azure Key Vault is now a core component of any solution, it should be in place holding the credentials for all our service interactions. In the case of Data Factory most Linked Service connections support the querying of values from Key Vault. Where ever possible we should be including this extra layer of security and allowing only Data Factory to retrieve secrets from Key Vault using its own Managed Service Identity (MSI).

If you aren’t familiar with this approach check out this Microsoft Doc pages:

https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault

Be aware that when working with custom activities in ADF using Key Vault is essential as the Azure Batch application can’t inherit credentials from the Data Factory linked service references.

Security Custom Roles

Given the above stance regarding Azure Key Vault. There is another critical point to consider; Data Factory has access to all the keys in Azure Key Vault. So who/what has access to Data Factory? Are we just moving the point of attack and is Key Vault really adding an extra layer of security?

That said, the security model and roles available for Data Factory are very basic. At the Azure management plane level you can be an Owner or Contributor, that’s it. Which in both cases will allow you access to anything in Key Vault using Data Factory as an authentication proxy.

The be really clear, using Data Factory in debug mode can return a Key Vault secret value really easily using a simple Web Activity request. See Microsoft docs below on doing this:

https://docs.microsoft.com/en-us/azure/data-factory/how-to-use-azure-key-vault-secrets-pipeline-activities

Given this, we should consider adding some custom roles to our Azure Tenant/Subscriptions to better secure access to Data Factory. As a starting point I suggest creating the following custom roles:

  • Data Factory Pipeline Executor
  • Data Factory Reader

In both cases, users with access to the Data Factory instance can’t then get any keys out of Key Vault, only run/read what has already been created in our pipelines. You can find some sample JSON snippets to create these custom roles in my GitHub repository here.

For this, currently you’ll require a premium Azure tenant. I also recommend using the Azure CLI to deploy the roles as the PowerShell preview modules and Portal UI screen don’t work every well at the point of writing this.

Dynamic Linked Services

Reusing code is always a great time savers and means you often have a smaller foot print to update when changes are needing. With Data Factory linked services add dynamic content was only supported for a handful of popular connection types. However, now we can make all linked services dynamic (as required) using the feature and tick box called ‘Specify dynamic contents in JSON format‘. I’ve blogged about using this option in a separate post here.

Create your complete linked service definitions using this option and expose more parameters in your pipelines to complete the story for dynamic pipelines.

Generic Datasets

Where design allows it I always try to simplify the number of datasets listed in a Data Factory. In version 1 of the resource separate hard coded datasets were required as the input and output for every stage in our processing pipelines. Thankfully those days are in the past. Now we can use a completely metadata driven dataset for dealing with a particular type of object against a linked service. For example, one dataset of all CSV files from Blob Storage and one dataset for all SQLDB tables.

At runtime the dynamic content underneath the datasets are created in full so monitoring is not impacted by making datasets generic. If anything, debugging becomes easier because of the common/reusable code.

Where generic datasets are used I’d expect the following values to be passed as parameters. Typically from the pipeline, or resolved at runtime within the pipeline.

  • Location – the file path, table location or storage container.
  • Name – the file or table name.
  • Structure – the attributes available provided as an array at runtime.

To be clear, I wouldn’t go as far as making the linked services dynamic. Unless we were really confident in your controls and credential handling. If you do, the linked service parameters will also need to be addressed, firstly at the dataset level, then in the pipeline activity. It really depends how far you want to go with the parameters.

Metadata Driven Processing

Building on our understanding of generic datasets and dynamic linked service, a good Data Factory should include (where possible) generic pipelines, these are driven from metadata to simplify (as a minimum) data ingestion operations. Typically I use an Azure SQLDB to house my metadata with stored procedures that get called via Lookup activities to return everything a pipeline needs to know.

This metadata driven approach means deployments to Data Factory for new data sources are greatly reduced and only adding new values to a database table is required. The pipeline itself doesn’t need to be complicated. Copying CSV files from a local file server to Data Lake Storage could be done with just three activities, shown below.

Building on this I’ve since created a complete metadata driven processing framework for Data Factory that I call ‘procfwk‘. Check out the complete project documentation and GitHub repository if you’d like to adopt this as an open source solution.

Parallel Execution

Given the scalability of the Azure platform we should utilise that capability wherever possible. When working with Data Factory the ‘ForEach’ activity is a really simple way to achieve the parallel execution of its inner operations. By default, the ForEach activity does not run sequentially, it will spawn 20 parallel threads and start them all at once. Great! It also has a maximum batch count of 50 threads if you want to scale things out even further. I recommend taking advantage of this behaviour and wrapping all pipelines in ForEach activities where possible.

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

In case you weren’t aware within the ForEach activity you need to use the syntax @{item().SomeArrayValue} to access the iteration values from the array passed as the ForEach input.

Hosted Integration Runtimes

Currently if we want Data Factory to access our on premises resources we need to use the Hosted Integration runtime (previously called the Data Management Gateway in v1 of the service). When doing so I suggest the following two things be taken into account as good practice:

  1. Add multiple nodes to the hosted IR connection to offer the automatic failover and load balancing of uploads. Also, make sure you throttle the currency limits of your secondary nodes if the VM’s don’t have the same resources as the primary node. More details here: https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime
  2. When using Express Route or other private connections make sure the VM’s running the IR service are on the correct side of the network boundary. If you upgrade to Express Route later in the project and the Hosted IR’s have been installed on local Windows boxes, they will probably need to be moved. Consider this in your future architecture and upgrade plans.

For larger deployments and Azure estates consider the wider landscape with multiple IR’s being used in a variety of ways. Data Factory might be a PaaS technology, but handling Hosted IRs requires some IaaS thinking and management.

Lastly, make sure in your non functional requirements you capture protentional IR job concurrency. If all job slots are full queuing Activities will start appearing in your pipelines really start to slow things down.

Azure Integration Runtimes

When turning our attention to the Azure flavour of the Integration Runtime I typically like to update this by removing its freedom to auto resolve to any Azure Region. There can be many reasons for this; regulatory etc. But as a starting point, I simply don’t trust it not to charge me data egress costs if I know which region the data is being stored.

Also, given the new Data Flow features of Data Factory we need to consider updating the cluster sizes set and maybe having multiple Azure IR’s for different Data Flow workloads.

In both cases these options can easily be changed via the portal and a nice description added. Please make sure you tweak these things before deploying to production and align Data Flows to the correct clusters in the pipeline activities.

Finally, be aware that the IR’s need to be set at the linked service level. Although we can more naturally think of them as being the compute used in our Copy activity, for example.

Wider Platform Orchestration

In Azure we need to design for cost, I never pay my own Azure Subscription bills, but even so. We should all feel accountable for wasting money. To that end, pipelines should be created with activities to control the scaling of our wider solution resources.

  • For a SQLDB, scale it up before processing and scale it down once finished.
  • For a SQLDW (Synapse SQL Pool), start the cluster before processing, maybe scale it out too. Then pause it after.
  • For Analysis service, resume the service to process the models and pause it after. Maybe, have a dedicated pipeline that pauses the service outside of office hours.
  • For Databricks, create a linked services that uses job clusters.
  • For Function Apps, consider using different App Service plans and make best use of the free consumption (compute) offered where possible.

You get the idea. Check with the bill payer, or pretend you’ll be getting the monthly invoice from Microsoft.

Building pipelines that don’t waste money in Azure Consumption costs is a practice that I want to make the technical standard, not best practice, just normal and expected in a world of ‘Pay-as-you-go’ compute.

I go into greater detail on the SQLDB example in a previous blog post, here.

Custom Error Handler Paths

Our Data Factory pipelines, just like our SSIS packages deserve some custom logging and error paths that give operational teams the detail needed to fix failures. For me, these boiler plate handlers should be wrapped up as ‘Infant’ pipelines and accept a simple set of details:

  • Calling pipeline name.
  • Run ID of the failed execution.
  • Custom details coded into the process.

Everything else can be inferred or resolved by the error handler.

Once established, we need to ensure that the processing routes within the parent pipeline are connected correctly. OR not AND. All too often I see error paths not executed because the developer is expecting activity 1 AND activity 2 AND activity 3 to fail before its called. Please don’t make this same mistake. Shown below.

Finally, if you would like a better way to access the activity error details within your handler pipeline I suggest using an Azure Function. In this blog post I show you how to do this and return the complete activity error messages.

Monitoring via Log Analytics

Like most Azure Resources we have the ability via the ‘Diagnostic Settings’ to output telemetry to Log Analytics. The out-of-box monitoring area within Data Factory is handy, but it doesn’t deal with any complexity. Having the metrics going to Log Analytics as well is a must have for all good factories. The experience is far richer and allows operational dashboards to be created for any/all Data Factory’s.

Screen snippet of the custom query builder shown below, click to enlarge.

When you have multiple Data Factory’s going to the same Log Analytics instance break out the Kusto queries to return useful information for all your orchestrators and pin the details to a shareable Azure Portal dashboard. For example:

ADFPipelineRun
| project TimeGenerated, Start, End, ['DataFactory'] = substring(ResourceId, 121, 100), Status, PipelineName , Parameters, ["RunDuration"] = datetime_diff('Minute', End, Start)
| where TimeGenerated > ago(1h) and Status !in ('InProgress','Queued')

Service Limitations

Please be aware that Azure Data Factory does have limitations. Both internally to the resource and across a given Azure Subscription. When implementing any solution and set of environments using Data Factory please be aware of these limits. To raise this awareness I created a separate blog post about it here including the latest list of conditions.

The limit I often encounter is where you can only have 40 activities per pipeline. Of course, with metadata driven things this is easy to overcome or you could refactor pipelines in parent and children as already mentioned above. As a best practice, just be aware and be careful. Maybe also check with Microsoft what are hard limits and what can easily be adjusted via a support ticket.

Timeout & Retry

Almost every Activity within Data Factory has the following three settings (timeout, retry, retry interval) which fall under the policy section if viewing the underlying JSON:

The screen shot above also shows the default values. Look carefully.

Awareness needs to be raised here that these default values cannot and should not be left in place when deploying Data Factory to production. A default timeout value of 7 days is huge and most will read this value assuming hours, not days! If a Copy activity stalls or gets stuck you’ll be waiting a very long time for the pipeline failure alert to come in. Hopefully the problem and solution here is obvious. Change this at the point of deployment with different values per environment and per activity operation. Don’t wait 7 days for a failure to be raised.

Credit where its due, I hadn’t considered timeouts to be a problem in Data Factory until very recently when the Altius managed services team made a strong case for them to be updated across every Data Factory instance.

Using Templates

Pipeline templates I think are a fairly under used feature within Data Factory. They can be really powerful when needing to reuse a set of activities that only have to be provided with new linked service details. And if nothing else, getting Data Factory to create SVG’s of your pipelines is really handy for documentation too. I’ve even used templates in the past to snapshot pipelines when source code versioning wasn’t available. A total hack, but it worked well.

Like the other components in Data Factory template files are stored as JSON within our code repository. Each template will have a manifest.json file that contains the vector graphic and details about the pipeline that has been captured. Give them a try people.

In addition to creating your own, Data Factory also includes a growing set of common pipeline patterns for rapid development with various activities and data flows.

Documentation

Do I really need to call this out as a best practice?? Every good Data Factory should be documented. As a minimum we need somewhere to capture the business process dependencies of our Data Factory pipelines. We can’t see this easily when looking at a portal of folders and triggers, trying to work out what goes first and what goes upstream of a new process. I use Visio a lot and this seems to be the perfect place to create (what I’m going to call) our Data Factory Pipeline Maps. Maps of how all our orchestration hang together. Furthermore, if we created this in Data Factory the layout of the child pipelines can’t be saved, so its much easier to visualise in Visio. I recommend we all start creating something similar to the example below.


That’s all folks. If you have done all of the above when implementing Azure Data Factory then I salute you 🙂

Many thanks for reading. Comments and thoughts very welcome. Defining best practices is always hard and I’m more than happy to go first and get them wrong a few times while we figure it out together 🙂

 

28 thoughts on “Best Practices for Implementing Azure Data Factory

    1. That’s an interesting one… are you testing the ADF pipeline? Or are you actually testing whatever service the ADF pipeline has invoked? If the former and you don’t care about the service called what are you even testing for?

      Like

      1. We are currently not doing any pipeline tests, just looking at how we may do it. We would mainly be interested in integration tests with the proper underlying services being called, but I guess we could also parameterize the pipelines sufficiently that we could use mock services and only test the pipeline logic, as a sort of unit test. The thinking so far is to have a separate folder in ADF for “test pipelines” that invoke other pipelines and check their output, then script the execution of the test pipelines in a CI build.

        BTW., Thanks for the very nice writeup in the original article!

        Like

  1. Excellent post and great timing as I’m currently getting stuck into ADF. We are following most of the above, but will definitely be changing a few bits after reading you’re post. One point we are unsure of is if we should be setting up a Data Factory per business process or one mega Factory and use the folders to separate the objects.

    Current thoughts are to create a factory for each distinct area (so one for Data Warehouse, one for External File delivery etc.) then one extra factory just containing the integration runtimes to our on-prem data that are shared to each factory when needed. From a CI pipeline point of view then, with separate factories one release doesn’t hold up the other. We don’t see us having more than about 10 distinct areas overall.

    Be great to hear your thoughts on best practice for this point. Thanks again for the great post.

    Like

  2. Hi Paul, great article! As someone relatively new to Data Factory, but familiar with other components within Azure and previous lengthy experience with SSIS, I wanted to as a couple of questions:-

    1. Would (and if so when would) you ever recommend splitting into multiple Data Factories as opposed to having multiple pipelines within the same Data Factory? In my head I’m currently seeing a Data Factory as analagous to a project within SSIS. From a deployment point of view, in SSIS having a “too big” project started making deployments and testing a little unwieldy with the volume of things being deployed; such as having to ensure that certain jobs were not running during the deployment and so on. Would this be the case with deploying the ARM template for DF? I’m also thinking of the security aspects, as I’m assuming RBAC is granted at the factory level?

    2. Does DF suffer from the same sort of meta data issues that SSIS did? E.G. when an underlying table had a column that was not used in a data flow changed, you still needed to refresh the metadata within SSIS even though effectively no changes were being made.

    3. From a code review/pull request perspective, how easy is it to look at changes within the ARM template, or are they sometimes numerous and unintelligible as with SSIS and require you to look at it in the UI? For SSIS collaborative working has always been a bit of a pain due to the shared xml structure.

    4. I was mentioned above around testing, what frameworks (if any) are you currently using?

    Thanks in advance if you get time to answer any of that, turned into more text than I was anticipating!

    Liked by 1 person

    1. Hi Matthew, thanks for the comments, maybe let’s have a chat about your points rather than me replying here. Drop me an email and we can arrange something. Cheers Paul

      Like

      1. It would definitely be good to hear an opinion on question number 1. I believe a lot of developers of ADF will struggle with this best practice. I can see a Data Factory becoming hard to maintain if you have multiple pipelines for different processes in the same Data Factory. It sounds like you can organize by using folders, but for maintainability it could get difficult pretty quickly.

        Liked by 1 person

      2. Hi All

        Sorry for the delayed reply.

        To point 1 in Matthew’s comment above.

        There are a few things to think about here:

        Firstly, I would consider using multiple Data Factory’s if we wanted to separate business processes. E.g. Finance, Sales, HR. Another situation might be for operations and having resources in multiple Azure subscriptions for the purpose of easier inter-departmental charging on Azure consumption costs. Finally, data regulations could be a factor. For example, resources are restricted to a particular Azure region. You can overcome this using region-specific Azure Integration Runtimes with ADF, but that adds a management overhead to deployments of IR’s and aligning specific (copy) Activities to those dedicated IR’s. Therefore, separate factories might make more sense.

        So, in summary, 3 reasons to do it… Business processes. Azure charging. Regulatory data restrictions.

        What I would not do is separate Data Factory’s for the deployment reasons (like big SSIS projects). That said, it is also worth pointing out that I rarely use the ADF ARM templates to deployment Data Factory. Typically, we use the PowerShell cmdlets and use the JSON files (from your default code branch, not ‘adf_publish’) as definitions to feed the PowerShell cmdlets at an ADF component level. PowerShell as granular cmdlets for each ADF component (Datasets, Linked Services, Pipelines etc). This gives much more control and means releases can be much smaller.

        Final thoughts, around security and reusing Linked Services. ADF does not currently offer any lower level granular security roles beyond the existing Azure management plane. Therefore, once you have access to ADF, you have access to all its Linked Service connections. Maybe this is a reason to separate resources and in-line with my first point about business processes.

        Hope this helps inform your decisions.

        Cheers
        Paul

        Like

  3. Hi Paul, Great article, i was wondering. I’m convinced by the github integration only on the dev env. (and by many other good practices you describe. I was wondering though : that means the triggers have to be scheduled the same way on all the environments ?

    Like

    1. Hi, yes great point, in this situation I would do Data Factory deployments using PowerShell which gives you much more control over things like triggers and pipeline parameters that aren’t exposed in the ARM template. Rather than using a complete ARM template, use each JSON file added to the repo ‘master’ branch as the definition file for the respective PowerShell cmdlet.

      For example:
      …\Git\pipeline\Pipeline1.json. Then in PowerShell use Set-AzDataFactoryV2Pipeline -DefinitionFile $componentPath.

      For a trigger, you will also need to Stop it before doing the deployment. So, something like…

      $currentTrigger = Get-AzDataFactoryV2Trigger
      if($currentTrigger -ne $null)
      Stop-AzDataFactoryV2Trigger
      Set-AzDataFactoryV2Trigger
      Start-AzDataFactoryV2Trigger

      Hope this helps.

      Like

  4. nice post, thank you. I started working with adf since 10 months and was wondering where to looks for solutions if I get stuck. Finally i got a way, thank you I hooked up to all your articles since yesterday 🙂
    Currently I am working on one of the new POC where trying to pull the data from api or website. In one of the scenario, I need to pull the data from excel (which is in web) and load in synapse.
    I did convert that into separate csv files for every sheet and process further. But while doing that i found the data in csv is junk after certain rows which is causing the following error.
    Error
    {
    “errorCode”: “BadRequest”,
    “message”: “Operation on target CopyDataFromBlobToADLS failed: Failure happened on ‘Sink’ side. ErrorCode=ParquetJavaInvocationException,’Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.lang.IllegalArgumentException:field ended by ‘;’: expected ‘;’ but got ‘Drain’ at line 0: message adms_schema { optional binary Country (UTF8); optional binary Year (UTF8); optional binary Rank (UTF8); optional binary Total (UTF8); optional binary SecurityApparatus (UTF8); optional binary FactionalizedElites (UTF8); optional binary GroupGrievance (UTF8); optional binary Economy (UTF8); optional binary EconomicInequality (UTF8); optional binary HumanFlightandBrain Drain\ntotal entry:10\r\norg.apache.parquet.schema.MessageTypeParser.check(MessageTypeParser.java:215)\r\norg.apache.parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:188)\r\norg.apache.parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:112)\r\norg.apache.parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:100)\r\norg.apache.parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:93)\r\norg.apache.parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:83)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetWriterBuilderBridge.getSchema(ParquetWriterBuilderBridge.java:188)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetWriterBuilderBridge.build(ParquetWriterBuilderBridge.java:160)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetWriterBridge.open(ParquetWriterBridge.java:13)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createWriter(ParquetFileBridge.java:27)\r\n.,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,”Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,’”,
    “failureType”: “UserError”,
    “target”: “PL_CopyFromBlobToAdls”,
    “details”: “”
    }

    Any idea the to cleans such data before processing further. Also whats the better way to ingest such data adf or bricks . Please advice. Thank you

    Like

  5. Great article Paul. I feel it missed out on some very important gotchas: Specifically that hosted runtimes (and linked services for that matter) should not have environment specific names. That is, I come across ADF’s where the dev runtime is called ‘devruntime’ and the test one is called ‘testruntime’ and the result is it’s extra difficult to migrate code. Another gotcha is mixing shared and non shared integration runtimes. Again it’s extra difficult to migrate code from a shared runtime to a non shared runtime (or vice versa). Making this mistake causes a lot of extra needless work reconfiguring runtimes and packages. It seems obvious to me that non top-level resources should not have environment specific names.

    Like

  6. Hi @mrpaulandrew, thanks a lot for this blob. i am trying to implement azure devops pipeline for adf following adf_publish approach (sorry for not choosing the other PS approach as i find this more standard 🙂 ). i am able parameterise and deploy linked services, but i also have parameters for the pipelines.. how do i achieve parameterisation of those from devops.. i been trying to find a solution for this for more than a week.. pls help, thanks in advance!

    Like

  7. Hello and thanks for sharing!
    What would be the recommended way to share common pipeline templates for multiple ADFs to use?
    Is export/import the only option?
    Thanks

    Like

    1. Good question. My preferred approach is to have a common code base in Git/GitHub. Then deploy the generic pipeline definitions to multiple target data factory instances using PowerShell cmdlets. At deployment time also override any localised configuration the pipeline needs.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.