Azure Data Lake Store – Files Dashboard – Welcome to the Blog of Paul Andrew

A tool I’ve used in the past to query my local disk drives is WinDirStat. It’s great for finding out where large files are hiding in your directories using a heat map visual. In this post I talk you through how I’ve created a similar tool for Azure Data Lake Store using PowerShell, SQL Server and Power BI.

Overview

For those of you new to any of this tech here’s a quick overview of how I achieved this, in picture form 🙂

PowerShell queries your Azure Subscription and Data Lake Storage accounts writing the results to a SQL Server database table.
A couple of views aggregate the data for ease.
Power BI visualises the results and allows quick filtering at various levels.

Before we dive in, please be aware I’ve put everything (all the code) I used to do this in a GitHub repository here: https://github.com/mrpaulandrew/AzureDataLakeStoreDIRStats

PowerShell

Out of the box PowerShell includes cmdlets to query Azure Data Lake Store files and folders. Namely:

Get-AzureRmDataLakeStoreChildItem
Get-Azure RmData Lake Store Item

More cmdlets here.

However, they don’t currently support a ‘recurse’ switch like you’d get with the main PowerShell Get-ChildItem cmdlet. Therefore this behaviour has to be engineered using a series of PowerShell functions. For this part of the script credit goes to Microsoft’s Alexandra Gattiker for this answer on Stack Overflow, many thanks.

Armed with these functions I wrapped the results with my Azure login and insert records into a SQL Server table using a stored procedure call.

Note; for the path parameter (line 34) I’ve left this as the root level. Feel free to query sub folders if preferred.

The PowerShell script does take a while to hit all levels of your data lake store via the API so please be patient if you have a lot of files in your account.

SQL Server

For the SQL Server part I used a very simple stored procedure that PowerShell could call for inserting each file record found. The destination table is also very basic containing just 6 fields, seen on the right.

Once the raw data was inserted I created a couple of views to aggregate the data by parsing the full file path to get the root level folder names. I was tempted to break out the STRING_SPLIT function here, but didn’t want to make this specific to SQL 2016 onwards. Therefore I used our old friends SUBSTRING and CHARINDEX.

I would welcome any changes or enhancements people want to make to the SQL querying part. I kept it simple for the purpose of this post.

Power BI

Next, the pretty stuff.

To support a single dashboard slicer my Power BI model uses the table and views from the SQL instance joined on the storage Account Name. Relationships on the right.

The dashboard is then simply a series of standard visuals with the favourite being the file heat map. I confess I filtered the one below to exclude a couple of large files making it worthy of a screen shot 🙂

Data Lake Store

As a side note to reduce noise in your storage dashboard… If you run a lot of U-SQL jobs against your storage using Azure Data Lake Analytics you may find the ‘system’ directory contains a lot of job profile stuff that could be removed. I recommend tweaking the service properties via the Azure portal to remove this. Specifically the number of days U-SQL job data is retained. Hope this helps.

Many thanks for reading

One thought on “Azure Data Lake Store – Files Dashboard”

Hi Paul,
Very useful entry, I tried it outputting data in a plain file text and after 9 hours running the script it gave me a 1 GB file for a DataLake Storage with a total size of 80 TB.

I would like to automatize this process sending the output to an Azure SQL Database like in your original code.
What method do you think that is more appropriate for this purpose?
I’m considering ADF Custom Activity, Azure Function or Azure Automation but I’m not sure what could be the best option.

Thanks

LikeLike

Leave a comment Cancel reply

About Me

mrpaulandrew

Paul (AKA @mrpaulandrew) is the Founder & CTO of Cloud Formations, a specialist data consultancy based in the UK. With nearly 20 years’ experience designing and delivering Microsoft data architectures, Paul leads a passionate team of engineers, supporting businesses small and large with scalable cloud platforms. Business value delivered through data insights. Over the years, Paul has covered the breadth and depth of design patterns and industry leading concepts, including Lambda, Kappa, Delta Lake, Data Mesh and Data Fabric. Paul is also a Microsoft Data Platform MVP, director for the Data Relay community conference, East Midlands user group leader, book author and mentor. In addition to the day job(s), Paul is a father of three, husband, foodie, runner, blood donor, geek, Lego, and Star Wars fan! Lastly, Paul confesses to enjoying a Ramstein playlist when given half a chance to do some coding for a customer project.

Eduardo says:

March 21, 2019 at 3:48 pm

Hi Paul,
Very useful entry, I tried it outputting data in a plain file text and after 9 hours running the script it gave me a 1 GB file for a DataLake Storage with a total size of 80 TB.

I would like to automatize this process sending the output to an Azure SQL Database like in your original code.
What method do you think that is more appropriate for this purpose?
I’m considering ADF Custom Activity, Azure Function or Azure Automation but I’m not sure what could be the best option.

Thanks

LikeLike