In this four part blog series I want to share my approach to delivering a metadata driven processing framework in Azure Data Factory. This is very much version 1 of what is possible and where can we build on this to deliver a hardened solution for our platform orchestrator. For the agenda of this blog series I’m planning to break things down as follows:
- Design, concepts, service coupling, caveats, problems. Post link.
- Database build and metadata. Post link.
- Data Factory build. Post link.
- Execution, conclusions, enhancements. Post link.
Blog Supporting Content in my GitHub repository:
Welcome back friends to part 2 of this 4 part blog series. In this post we are going to deliver on some of the design points we covered in part 1 by building the database to house our processing framework metadata.
Let’s start with a nice new shiny Azure SQLDB database and schema. This can easily be scaled up as our calls from Data Factory increase and ultimately the solution we are using the framework for grows.
I attempted to abbreviate Processing Framework to procfwk. But whatever you prefer is fine.
Next, our relational tables. Given the simple requirements I’ve create just 5 tables to house the metadata. 3x for our pipeline and staging information and 2x for execution/logging runtime information. The database diagram was very difficult to draw 🙂
For each of the tables here is my narrative for their role within the framework:
- Pipeline Processes – This is the core table of metadata which will be used to drive all other behaviour in the framework. The Stage ID and Pipeline ID are critical for the execution. The table will be queried for both these sets of details which will ultimately be provided to Data Factory as arrays. Firstly, for the stages of processing to provide the sequentially dependency chain. Then secondly, for the pipeline details to provide the parallel execution of all child items. Finally, processes can be disabled/enabled at the pipeline level.
- Pipeline Parameters – As we are executing existing child pipelines within Data Factory these could contain custom parameters that need to be passed when called by the Azure Function. This table will house those key value pairs which will correspond to the actual pipeline parameters in Data Factory.
- Processing Stage Details – This table offers additional narrative on the processing stages as well as allowing a given stage to be disabled/enabled.
The remaining two tables which are excluded from any foreign key constraints relate to runtime execution and logging as hopefully the names suggest.
- Current Execution – At runtime this table will be populated with a static set of all stage and pipeline metadata. It will be isolated in this table for the duration of a processing run. The composite primary key ensures that we have integrity in the original metadata. However, at this point there isn’t anything more defensive in place to handle if it wasn’t.
- Execution Log – This table is a long term log of all executions for reference and for performance checking of pipeline durations.
For each of the stored procedures here is my narrative for their role within the framework given the tables above:
- CreateNewExecution – To populate the Current Execution table with the metadata for the next processing run. This will only capture enabled pipelines and enabled staged. The procedure will also generate an execution GUID that will be used for all following log updates relating to the current processing run.
- UpdateExecutionLog – To archive off the last completed execution run into the long term logging table. Separating this means the large logging table doesn’t hinder performance of the current executions calls for metadata and allows pipeline durations to be analysed in an offline way. Consider it as the metadata execution data warehouse table with a run snapshot grain!
The next group of procedures will be used at runtime by Data Factory to get all the required processing details. I’ve listed them according to their natural chain of events at runtime.
- GetProcessStages – To return a distinct list of Stage ID values that will be executed sequentially by Data Factory. At runtime this procedure should only be called once.
- GetPipelinesInStage – To return a list of all Pipelines ID’s and Pipeline Names within a given execution stage. At runtime this procedure should only be called once per stage.
- GetPipelineParameters – To return a JSON string for the pipeline parameters that need to be passed. This is a slightly crude way to create JSON as a string that will eventually be passed onto an Azure Function as the request body (SQL snippet below). At runtime this procedure should only be called once per pipeline.
The final group of stored procedures I hope are fairly self explanatory and will each include an UPDATE statement to the Current Execution table. I’ve also listed them in their nature order of use.
- SetLogStageStart – To set the start date and time of the pipelines in the stage being executed.
- SetLogPipelineRunning – To set the status of the pipelines in the stage to running.
- SetLogPipelineSuccess – To set the status of a pipeline in the stage to success with a date and time.
- SetLogPipelineFailed – To set the status of a pipeline in the stage to failed with a date and time.
Finally, I’ve created some sample metadata to get us started before the next part of the build. This data along with all the database code is available in GitHub, link at the top.
This concludes the second part in this blog series. To recap:
- Database created.
- Tables created.
- Stored procedures created.
- Sample metadata added.
In part 3 of Creating a Simple Staged Metadata Driven Processing Framework for Azure Data Factory Pipelines we’ll build the Data Factory pipelines and connect up all the database procedures to drive the orchestration.
Many thanks for reading.
8 thoughts on “Creating a Simple Staged Metadata Driven Processing Framework for Azure Data Factory Pipelines – Part 2 of 4”
Not making the Execute Pipeline Activity dynamic is one of the most head-scratching things. Major shortcoming in ADF.
Not really. If something is really easy to work around it gets a lower priority on the Microsoft backlog.
I do tend to agree with Steven. Whilst there are workarounds as you state Paul and that therefore defines the backlog, it’s akin to having a SSIS package that can’t dynamically call another package. It is head scratching and doesn’t make architectural sense to Microsoft’s mindset with ADF implementation