Blog

Data Factory vs. Databricks: When your love for data is missing some (Apache) Spark

How do you choose the data partner that suits you best? Azure Data Factory has great benefits when you start out, but Apache Spark allows you to explore deeper layers. So is Azure Databricks your data partner that brings the "Spark" back into your data project? One thing is for sure, in Azure you don't necessarily have to choose a monogamous data solution.

Gregor Suttie

Author

Gregor Suttie Azure Architect & MVP

Picture this. Your boss has asked you to load some data sources to Azure for further processing. Like any wise programmer, you start your quest for information on the internet. It’s not long before you stumble upon Azure Data Factory, on which you find the following information:

Azure Data Factory

Azure Data Factory (ADF) is a fully managed, serverless data integration service. It can be used for ingesting and transforming data sources by using Pipelines. Out of the box, you can connect to over 90 already built-in data sources by using a Graphical User Interface (GUI). Therefore, in ADF, you can configure the connection to your data sources by hand in an effortless manner. There is an option to custom code connectors as well. 

 

The honeymoon phase

ADF is like love at first sight. You create connections to your datasets by hand using the GUI. Data sources can be loaded by dragging the necessary tasks to your ADF Pipeline. Given that ADF is a drag and drop tool, there is no need to learn yet another programming language. Your expectations are to be up to speed in no time! This assumption proves to be correct. After a few days of configuring and careful tweaking, your data solution is working like a charm. Starting your data journey off with ADF is starting to look like a match made in the (Azure) cloud!

 

When the Spark is just no longer there…

As time goes on, your boss asks you to onboard more data sources, some of which are not natively supported by ADF. Therefore, you spent a ton of time configuring and setting up custom connections to your data sources. If that’s not bad enough, you are asked to transform and clean data before saving it to storage. All these transformations must be configured by hand. As a result, maintaining ADF is becoming quite cumbersome.

You start wondering… is the usage of ADF such a blissful marriage after all? You decide to continue your quest for a more suitable data partner. It’s not long before you find Apache Spark and Azure Databricks:  

Apache Spark

Apache Spark is a multi-language engine used for data engineering, data science, and machine learning use cases. For example, Apache Spark can be used for aggregating, transforming, or cleaning large amounts of data.   

Azure Databricks

One of the available tools that allow you to run Apache Spark code on Azure is Azure Databricks. Azure Databricks will help you gain insights into your data while running the latest version of Apache Spark. Set up your environment within minutes, enjoy autoscaling, and collaborate with others by using notebooks. In Azure Databricks, you can aggregate, transform, and clean data by using Apache Spark code. Azure Databricks offers other functionalities as well, like analyzing datasets and creating visualizations. You can even use Azure Databricks to solve Data Science use cases. Databricks supports multiple languages like Python, Scala, R, Java, and SQL.

Note: another often-used option within Azure is to code Apache Spark using Azure Synapse Analytics. The functionality of Azure Synapse Analytics and Azure Databricks does have a significant overlap. However, the differences between the two tools are out of scope for this article.

 

How to get it (the Spark) back

Looking at ADF and Azure Databricks, you find some similarities and differences. Both can be used for data ingestion and transformation. However, the way they go about it is different. In ADF you create pipelines in the GUI. Meanwhile, in Azure Databricks, you write code. It is the coding part that makes the difference. Even though it will take you some time to get acquainted with Apache Spark, using code will enable you to create data pipelines in a more generic manner. It will be easy to reuse some of the existing logic when new datasets are onboarded. Will Azure Databricks be the key to getting the Spark back into your data project?

 

A marriage with multiple (data) partners

On Azure, one does not necessarily have to go with a monogamous data solution. Some tools are perfectly suited to be used alongside one another. As luck would have it, this is precisely the case with ADF and Azure Databricks. Therefore, you decide to deploy Azure Databricks alongside ADF. Your decision is simple: existing pipelines containing only a small number of transformations stay within ADF. You rewrite more complex pipelines to Apache Spark code and run them with Azure Databricks. Using Databricks Activities within the ADF GUI, you can easily schedule Azure Databricks notebooks within ADF Pipelines, creating a hybrid solution.  

 

Conclusion: when it’s time to switch (data) partners

Looking back at this data love story, you conclude that ADF was great when you were still new to data and did not have the time to learn Apache Spark. Using the GUI, you set up ingestion and transformation Pipelines with minimal effort. Having over 90 data sources natively integrated with ADF made onboarding datasets straightforward. However, when you started onboarding more data, creating pipelines by hand became difficult to scale. This is when you started thinking about adding a new tool to your technology stack.

You are happy with your decision to go with Azure Databricks. Azure Databricks offered a more comprehensive platform and made creating generic and reusable solutions easier. In addition, Azure Databricks allowed you to achieve other goals like data analysis and visualization. Luckily, ADF made it easy for you to add Azure Databricks to your technology stack, as you were able to use Databricks Activities within the ADF GUI.

 

ADF or Azure Databricks: our advice

Speaking in generic terms, Intercept advises to use ADF when:

  • You have simple data ingestion and transformation needs.
  • You don’t have coding experience with Python, Scala, or Java, and you don’t have the time or the ambition to learn Apache Spark.

Intercept advises to use Azure Databricks when:

  • You need a generic and scalable solution that can be applied to multiple data sources and use cases.
  • You have coding experience and want to learn Apache Spark.
  • You are creating complex data solutions.

The above article was the second article in our data series. Read the first article about building Data Lakes also. Do you have any more questions about using Azure Data Factory or Azure Databricks? We would love to bring you in touch with our Data Experts. Mail us, and we will help you!