Code Project Overview
This open source code project delivers a simple metadata driven processing framework for Azure Data Factory (ADF). The framework is made possible by coupling ADF with an Azure SQL Database that houses execution stage and pipeline information that is later called using an Azure Functions App. The parent/child metadata structure firstly allows stages of dependencies to be executed in sequence. Then secondly, all pipelines within a stage to be executed in parallel offering scaled out control flows where no inter-dependencies exist.
The framework is designed to integrate with any existing Data Factory solution by making the lowest level executor a stand alone Worker pipeline that is wrapped in a higher level of controlled (sequential) dependencies. This level of abstraction means operationally nothing about the monitoring of orchestration processes is hidden in multiple levels of dynamic activity calls. Instead, everything from the processing pipeline doing the work (the Worker) can be inspected using out-of-the-box ADF features.
This framework can also be used in any Azure Tenant and allow the creation of complex control flows across multiple Data Factory resources by connecting Service Principal details through metadata to targeted Subscriptions > Resource Groups > Data Factory’s and Pipelines, this offers very granular administration over data processing components in a given environment.
Framework Key Features
- Granular metadata control.
- Metadata integrity checking.
- Global properties.
- Complete pipeline dependency chains.
- Execution restart-ability.
- Parallel execution.
- Full execution and error logs.
- Operational dashboards.
- Low cost orchestration.
- Disconnection between framework and Worker pipelines.
- Cross Data Factory control flows.
- Pipeline parameter support.
- Simple troubleshooting.
- Easy deployment.
- Email alerting.
- Key Vault Integration.
Thank you for visiting, details on the latest framework release can be found below.
Version 1.8.2 of ADF.procfwk is ready!
A slightly dry set of release notes this time, relating to security! Such fun! 🙂
Way back in v1.1 of the processing framework I added SPN details to the database metadata. This was and still is to enable the Azure Functions to authenticate against Data Factory when executing, checking and returning error information for our Worker pipelines.
Storing these SPN values in a database table (even though encrypted) never really felt great, but at the time it was the best option. Reflecting again on the original design and considering the credential have to be stored somewhere to offer us the flexibility to hit and authenticate against any Data Factory in the subscription and allowing the use of different SPN’s for different pipelines, this was the right choice.
As a refresher, the following visual from the v1.1 release demonstrates the SPN values being retrieved from the metadata and passed to the Functions via the request body. In addition to storing the values in the database in means that the client secret are also available to be view via the Data Factory monitoring screens if you dig into the activity input values. So, again, this doesn’t feel great.
The key to this new capability is that Azure Function App’s now support Managed Service Identities (MSI). This means a Function App can authenticate itself via an access policy in Azure Key Vault. Then SPN values can be retrieved as secrets from Key Vault as part of the Function execution. Microsoft docs link:
To clarify, previously, I didn’t do this because without the MSI support the Function would still of needed its own SPN credentials to first authenticate against Key Vault before returning the SPN credentials for Data Factory. In this case, the “starting point” SPN had to be stored somewhere, so at the time I choose the database.
Given the above, what we can now do, instead of storing the actual SPN details in the database. Just store the Key Vault URL’s in the database that offer an address to the actual SPN details. Then via its own MSI ask the Function to get the details from Key Vault using the URLs.
The authentication flow if you choose this option would then have the extra step of the Function hitting Key Vault as well as runtime to resolve the secret URL to an actual value. Rather than having the actual value already from the database.
As mentioned, hopefully this is a fairly subtle change and an easy one to implement. By default, the previous SPN handling behaviour will remain unchanged. This is another completely optional feature. Please also be aware, no changes are required in Data Factory to implement this option. Data Factory doesn’t care if its passing the SPN Id to the Function activities or the Key Vault URL for the SPN Id.
Final thought here, before going into the development detail, it would be even nicer to take this one step further and just use the Function MSI to authenticate against Data Factory directly. However, this isn’t currently supported and would also reduce the flexibility in the metadata allowing different SPN’s for different pipelines. Therefore, SPN values still need to be passed around for now to authenticate against ADF.
A new framework property has been added to support the alternative authentication method. This is called ‘SPNHandlingMethod‘ and expects one of the following values:
- StoreInDatabase = this represents the existing method of handling SPN details. The Principal ID and Secret are stored directly in the metadata database with the secret value being encrypted, then decrypted at runtime.
- StoreInKeyVault = this value supports the new optional behaviour for SPN details and instead of storing the values directly in the database. Instead this just stores the Key Vault secret URL’s where the actual values can be found/addressed.
In either case this property drives which attributes are inserted into the table
[dbo].[ServicePrincipals] as show below.
Currently, you can use one option or the other and for each it is expected that both values will be in the database or in Key Vault. It’s not possible to store just the Principal Id in the database, but the Secret in Key Vault. I’ll add this hybrid setup as a low priority backlog item in case anyone does want this flexibility, here.
To support the Key Vault handling behaviour 2x new attributes have been added to the
This pair of attributes is required separately as data types prevent the existing PrincipalId and PrincipalSecret attributes being reused.
Below is how the content of the table could look when the property SPNHandlingMethod is set to each value.
Various runtime and metadata entry time integrity checks have been setup to ensure the values entered are in the correct format and valid if combinations of attributes have been incorrect entered.
A new ‘advisory’ scalar function,
[procfwk].[CheckForValidURL], has been added to the database to parse the provided Key Vault secret URL’s. This validates the URL string with conditions specific to a Key Vault secret URL. It is not a generic URL validator. It ensures the expected sub-domain is present and attempts to check if a GUID (secret version) has been included in the URL.
[procfwk].[AddServicePrincipalWrapper]– this procedure has been added to sit above the behaviour specific SPN insert procedures. It inspects the new property value and then passes parameters to the respective child procedures. At this level the parameters for the SPN principal and secret can be consider for either behaviour so are called
@PrincipalSecretValuewith data types of NVARCHAR(MAX).
[procfwk].[AddServicePrincipal]– this procedure remains unchanged. However, going forward it should be longer be called directly. Only by the above wrapper procedure.
[procfwk].[AddServicePrincipalUrls]– this procedure has been added to handle the new Key Vault secret URL’s and insert them into the new attributes of the table
[dbo].[ServicePrincipals].In additional, the passed URL’s will be validated using the above scalar function. This is only a soft validation mainly to guard against incorrect Key Vault URL’s being provided and with a secret version GUID. View the printed outputs from the stored procedure should the validate fail and the warning be needed.
[procfwk].[DeleteServicePrincipal]– this procedure has been updated to now use the credential ID value to perform deletions from the framework metadata tables. The procedure is now value/behaviour agnostic and uses the new property value only to establish what defensive checks to perform internally before deleting pipeline and credential links.
In all cases, these procedures are intended to support the control of the framework metadata at deployment time and ensure data integrity. Record insertions could be done directly, but wouldn’t be supported by the framework.
[procfwk].[CheckMetadataIntegrity]– new runtime checks have been added to this procedure relating to the new property value, as follows:
- Does the SPNHandlingMethod property have a valid value?
- Does the Service Principal table contain both types of SPN handling for a single credential?
To take advantage of this new feature of storing Key Vault URL’s in the metadata database you firstly need to enable your Function App’s MSI via the Azure Portal or via your deployment processes.
Screen shot of doing this manually via the Azure Portal blades:
Once the Function App MSI is enabled, add it to your Azure Key Vault access policies so the Functions can independently authenticate themselves and retrieve values (secrets in the case of the framework) from Key Vault.
In the context of the processing framework this new behaviour means the values for:
- Application Id
- Authentication Key
Can now be passed in the request body as Key Vault secret URL’s for the Functions:
Internally the Functions establish if a URL has been provided then query Azure Key Vault using the Microsoft DefaultAzureCredential() class from the Azure.Identity libraries to authenticate.
For reusability across the three Functions this handling is wrapped up in a couple of helper classes, including a Key Vault client. These helpers are then used within each Function as follows:
The Key Vault URL resource is determine from the complete secret URL provided. This means that potentially pipeline authentication could be passed off to separate Key Vault resources if required.
Most importantly, it is expected that the Key Vault secret URL will be provided (inserted into the database) in the following format and excluding the secret version GUID.
The Functions aren’t aware of the database property (SPNHandlingMethod) influencing behaviour and rely on the metadata values being correctly entered. Hence the metadata being validated in several places. The informed assumption saves an additional step in the Function first querying the database.
That concludes the release notes for this version of ADF.procfwk.
Please reach out if you have any questions or want help updating your implementation from the previous release.