Spark Data Frame Infer Schema vs Data Factory Get Metadata Activity

Here’s a quick bit of information I thought was worth sharing…

For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.

Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone 😉

A quick example from my playing around:

The actual dataset as seen in Notepad++

The metadata structure from Data Factory

The interred schema from the Spark data frame

A side by side comparison

Column Name	ADF Data Type	Spark Data Type
SalesOrderID	string	integer
SalesOrderDetailID	string	integer
OrderQty	string	integer
ProductID	string	integer
UnitPrice	string	double
UnitPriceDiscount	string	double
LineTotal	string	double
rowguid	string	string
ModifiedDate	string	string

This was only a small dataset with only 542 rows of data, I did the same thing with others before drawing this conclusion.

To that end, I’m suggesting the following:

For quick schema inference without to much effort, the ADF Metadata Activity can help during a control flow operation.
For more accurate schema inference use a dedicated transformation tool.
For exact schema definitions, create it yourself or ideally inherit it from a metadata backed source system.

Side note; if you go to the Dataset within the Data Factory UI and Import Schema from the source connection, you’ll also get the same result as the Metadata Activity, seen below.

Many thanks for reading

4 thoughts on “Spark Data Frame Infer Schema vs Data Factory Get Metadata Activity”

Nice and cool info!

LikeLike

Pingback: Spark Infer Schema vs ADF Get Metadata – Curated SQL

Nice Info. Databricks recommends that not to use inferSchema as it triggers separate job to find out the schema definition from the sample dataset.

So it is always better to inherit it from a metadata backed source system as mentioned above…

LikeLike

mrpaulandrew says:

October 28, 2020 at 1:58 pm

Yes, good point. Really it was about comparing the accuracy to the ADF metadata activity too.

LikeLike

Reply

Leave a comment Cancel reply

About Me

mrpaulandrew

Paul (AKA @mrpaulandrew) is the Founder & CTO of Cloud Formations, a specialist data consultancy based in the UK. With nearly 20 years’ experience designing and delivering Microsoft data architectures, Paul leads a passionate team of engineers, supporting businesses small and large with scalable cloud platforms. Business value delivered through data insights. Over the years, Paul has covered the breadth and depth of design patterns and industry leading concepts, including Lambda, Kappa, Delta Lake, Data Mesh and Data Fabric. Paul is also a Microsoft Data Platform MVP, director for the Data Relay community conference, East Midlands user group leader, book author and mentor. In addition to the day job(s), Paul is a father of three, husband, foodie, runner, blood donor, geek, Lego, and Star Wars fan! Lastly, Paul confesses to enjoying a Ramstein playlist when given half a chance to do some coding for a customer project.

Arulmouzhi E says:

October 14, 2020 at 4:23 pm

Nice and cool info!

LikeLike

Pingback: Spark Infer Schema vs ADF Get Metadata – Curated SQL
Balan says:

October 28, 2020 at 1:40 pm

Nice Info. Databricks recommends that not to use inferSchema as it triggers separate job to find out the schema definition from the sample dataset.

So it is always better to inherit it from a metadata backed source system as mentioned above…

LikeLike

1. mrpaulandrew says:
  
  October 28, 2020 at 1:58 pm
  
  Yes, good point. Really it was about comparing the accuracy to the ADF metadata activity too.
  
  LikeLike

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Please feel free to share this:

4 thoughts on “Spark Data Frame Infer Schema vs Data Factory Get Metadata Activity”

Leave a comment Cancel reply