In this four part blog series I want to share my approach to delivering a metadata driven processing framework in Azure Data Factory. This is very much version 1 of what is possible and where can we build on this to deliver a hardened solution for our platform orchestrator. For the agenda of this blog series I’m planning to break things down as follows:
- Design, concepts, service coupling, caveats, problems. Post link.
- Database build and metadata. Post link.
- Data Factory build. Post link.
- Execution, conclusions, enhancements. Post link.
Blog Supporting Content in my GitHub repository:
The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.
Leading up the build of such a framework a few key factors design decisions had to be thought about. I think its worth sharing these to begin with for context and because as the title suggests this is a simple staged approach to processing. Here’s why:
- Questions; How should the framework call the execution of the work at the lowest level? Or to ask the question in more practical terms for Azure Data Factory. Should we define a particular Data Factory Activity type in our metadata, then dynamically switch between different activities for the child pipeline calls? I did consider using the new Data Factory Switch activity for exactly this, but then decided against it (for now).
My Answer; To keep things simple the lowest level executor will be the call to another Data Factory pipeline, affectively using the Execute Pipeline Activity. Doing this means specifics of the actual work can be wrapped up using a tailored activity type, with tailored values (eg. Using a Databricks Notebooks Activity referencing a particular Linked Service and Workspace). Hitting a child pipeline as the lowest level of execution in the framework caller offers an easier abstraction over the actual work being done.
- Questions; Given our understanding in point 1 of the child level call from the framework. How are we technically going to manage this? The Execute Pipeline Activity in ADF does not allow dynamic values. The child pipeline referenced must be hard coded. Also, what if our solution orchestration needs to span multiple Data Factory resources?
My Answer; Use an Azure Function to execute any Data Factory Pipeline in any Data Factory. You may recall I created this in my last blog post 😉
- Questions; What should be used to house our orchestration metadata? Would a set of JSON file artefacts be sufficient? Or, at the other end of the complexity spectrum, maybe we use the Gremlin API within an Azure Cosmos Database to create a node and edges based solution for connecting dependencies between processes.
My Answer; I settled on an Azure SQLDB as a middle point between a file and Cosmos. This also made development time easier as I do love a bit of T-SQL.
- Questions; Building on point 3 relating to our up stream and down stream dependencies. How could/should these be handled/connected without resulting in a single threaded, sequential set of processes. Then, how can be make the metadata simple enough to edit without requiring a complete overhaul of many to many connections.
My Answer; I’ve introduced the concept of execution stages and child processes within a stage. As stated in point 1, the lowest level of execution will be done using ADF pipelines. Doing this means stages can execute sequentially to support dependencies. But child processes can execute in parallel to offer the ability of scaling out the lowest level executions. Therefore, child processes should have not inter-dependencies.
- Questions; Should we be concerned about tightly coupling Azure Data Factory to an Azure SQLDB for metadata and an Azure Function App for child process execution? Is this good practice?
My Answer; Yes, I’m concerned. This isn’t great, I’m fully aware. We could add others layer of abstraction into the framework, but for now keeping things simple is my preference. Also, in the medium term I really hope Microsoft will allow the Execute Pipeline Activity to be dynamic, avoiding the need to call from ADF to a Function and back.
- Questions; How should the metadata be used, toggled and controlled at runtime?
My Answer; To offer a small amount of disconnect in the overall execution I’m copying how the SQL Server Agent behaves for a given run of tasks. If you aren’t familiar with the SQL Agent, basically I’ll take a static copy of the processing metadata and use this as a fixed set of things during runtime. This means any changes made to the metadata during processing won’t compromise the execution run in progress. Finally, to address the control part of this question, bit fields will be added in the metadata for the stage level and child process level. Meaning anything can easily be disabled pre execution.
I think that is a good point to conclude the first part of this blog series. To recap:
- Idea established.
- Design work done.
- Concepts defined.
- Low level function executor already built.
In part 2 of Creating a Simple Staged Metadata Driven Processing Framework for Azure Data Factory Pipelines we’ll build the database schema required to drive the orchestration framework.
Many thanks for reading.