A tool I’ve used in the past to query my local disk drives is WinDirStat. It’s great for finding out where large files are hiding in your directories using a heat map visual. In this post I talk you through how I’ve created a similar tool for Azure Data Lake Store using PowerShell, SQL Server and Power BI.
Overview
For those of you new to any of this tech here’s a quick overview of how I achieved this, in picture form 🙂
- PowerShell queries your Azure Subscription and Data Lake Storage accounts writing the results to a SQL Server database table.
- A couple of views aggregate the data for ease.
- Power BI visualises the results and allows quick filtering at various levels.
Before we dive in, please be aware I’ve put everything (all the code) I used to do this in a GitHub repository here: https://github.com/mrpaulandrew/AzureDataLakeStoreDIRStats
PowerShell
Out of the box PowerShell includes cmdlets to query Azure Data Lake Store files and folders. Namely:
- Get-AzureRmDataLakeStoreChildItem
- Get-Azure RmData Lake Store Item
More cmdlets here.
However, they don’t currently support a ‘recurse’ switch like you’d get with the main PowerShell Get-ChildItem cmdlet. Therefore this behaviour has to be engineered using a series of PowerShell functions. For this part of the script credit goes to Microsoft’s Alexandra Gattiker for this answer on Stack Overflow, many thanks.
Armed with these functions I wrapped the results with my Azure login and insert records into a SQL Server table using a stored procedure call.
Note; for the path parameter (line 34) I’ve left this as the root level. Feel free to query sub folders if preferred.
The PowerShell script does take a while to hit all levels of your data lake store via the API so please be patient if you have a lot of files in your account.
SQL Server
For the SQL Server part I used a very simple stored procedure that PowerShell could call for inserting each file record found. The destination table is also very basic containing just 6 fields, seen on the right.
Once the raw data was inserted I created a couple of views to aggregate the data by parsing the full file path to get the root level folder names. I was tempted to break out the STRING_SPLIT function here, but didn’t want to make this specific to SQL 2016 onwards. Therefore I used our old friends SUBSTRING and CHARINDEX.
I would welcome any changes or enhancements people want to make to the SQL querying part. I kept it simple for the purpose of this post.
Power BI
To support a single dashboard slicer my Power BI model uses the table and views from the SQL instance joined on the storage Account Name. Relationships on the right.
The dashboard is then simply a series of standard visuals with the favourite being the file heat map. I confess I filtered the one below to exclude a couple of large files making it worthy of a screen shot 🙂
Data Lake Store
As a side note to reduce noise in your storage dashboard… If you run a lot of U-SQL jobs against your storage using Azure Data Lake Analytics you may find the ‘system’ directory contains a lot of job profile stuff that could be removed. I recommend tweaking the service properties via the Azure portal to remove this. Specifically the number of days U-SQL job data is retained. Hope this helps.
Many thanks for reading
Hi Paul,
Very useful entry, I tried it outputting data in a plain file text and after 9 hours running the script it gave me a 1 GB file for a DataLake Storage with a total size of 80 TB.
I would like to automatize this process sending the output to an Azure SQL Database like in your original code.
What method do you think that is more appropriate for this purpose?
I’m considering ADF Custom Activity, Azure Function or Azure Automation but I’m not sure what could be the best option.
Thanks
LikeLike