Happy New Year folks and welcome back to work! I hope you had a nice break. 🙂
I decided to kick off the new year of blogging with something fairly simple following some support work I did helping out a friend in the community. The problem encountered was twofold when dealing with data in Azure Data Lake Analytics and performing transformations using U-SQL.
- The dataset used custom field delimiters beyond the ‘out of box’ offerings of tabs and comma’s in the U-SQL extractors.
- The columns in the dataset varied from row to row with different rows containing different data values.
To combat these issues and for the purpose of this blog I’ve created this very simple text file seen below. Here we have 3 records in the file (Person 1, 2 and 3), but a single record is spread across multiple rows with a varying amount of fields on each line.
In both cases we can overcome these issues with the wonderful U-SQL string function split… Something that is now also available in T-SQL. Woo!
Using split we can define any delimiting character, pipes in my sample data. Plus, we can use the length of the array returned in the split to handle the variable rows. How? See below.
First, in the U-SQL extractor we need to ignore anything in the rows and simply ingest each complete row as a single string (128KB limit). To do this we can use the ‘out of box’ text extractor making sure we set the qualifier switch to false.
Next, let’s use split with a WHERE clause looking at the lengths of the arrays to deal with the varying amounts of fields. We can see in the file name data has 3 fields, address data has 5 fields and other data has 6.
Here you can see in the U-SQL I have put each subset of data in separate “table variables”.
Finally, we can join the record subsets back together to form a standard width, more conventional CSV output file. Or in production do something else with the staged data.
The proof below. 3 records in a clean CSV file with field headers.
If you’d like a copy of the complete U-SQL script please feel free to download it here.
I hope you found the above example useful. Custom delimiters and varying fields per row handled.
Many thanks for reading.