The road to becoming an AI creator | Article introduction | No more manual data integration! Learn the full story of "metadata-driven ETL" that dramatically improves efficiency with Azure Data Factory. #AzureDataFactory #ETL #MetadataDriven
Video explanation
[For beginners] What is "metadata-driven ETL" created with Azure Data Factory? A thorough explanation of the next-generation architecture that dramatically improves the efficiency of data integration!
Hello! I'm John, a blog writer who has been following the trends of AI technology for many years. Modern business is an era in which "data" plays a leading role. Companies are overflowing with a wide variety of data, such as sales data, customer information, website access logs, and factory sensor information. However, most of this data is stored in different places, such as Excel files, in-house databases, and cloud services such as Salesforce. The task of collecting this scattered data and arranging it in a form that is easy to analyze is a headache for many companies.
To solve this problem, a technology called "ETL" has been used, but with the conventional method, experts have to write programs from scratch every time a new data source is added, which takes time and costs money. It's like building a whole new kitchen every time you want to make a new dish. But what if there was a magical kitchen where you could make any dish just by rewriting the "recipe"?"Metadata-driven ETL Framework using Azure Data Factory"is the technology that makes this dream come true. It may sound a little difficult, but don't worry. By the time you finish reading this article, you will surely understand its powerful concept and benefits!
What is "metadata-driven ETL"?
To understand this technology, let’s first break down some keywords.
What is ETL? The process of "cooking" data
ETL is a term that describes the basic flow of data processing and is an acronym for the following three steps:
- Extract: Extract the original data from various locations (collect ingredients).
- Transform: The extracted data is processed and formatted according to easy-to-use formats and rules (washing, cutting, and seasoning ingredients).
- Load: Put the transformed data into the final storage destination (such as a data warehouse) (on a plate).
In other words, ETL isA series of "cooking" processes that involves collecting scattered data, arranging it neatly, and storing it in a table for analysis.The role of an ETL tool is to automate this process.
The problem of traditional ETL being "rigid"
Traditionally, most ETL processes have been developed using a method known as "hard coding," whereby developers write instructions for where to get data, the transformation rules, and where to store it directly into the program code.
The problem with this method is thatIn any case, "inflexible"For example, every time a request was made such as "We want to add new social media data" or "We want to slightly change the calculation rules for sales data," developers had to laboriously modify, test, and redeploy the program code. If there were 10 data sources, 10 dedicated pipelines (defining the processing flow) were required, and if there were 100 data sources, 100 dedicated pipelines (defining the processing flow) were required, and just maintaining them required a huge amount of effort and cost.
"Metadata-driven" approach as a solution
So then,"Metadata-Driven"This is a groundbreaking approach.
Simply put, "metadata" is "data about data." In this context, it is the equivalent of the "instructions" or "blueprints" for ETL processing.
In a metadata-driven approach, the specific steps of ETL processing (such as "from which server" and "from which table" to extract data, "by what rules" to transform it, and "where" to store it) are not written directly into the program.All of this is stored as "metadata" in an external database.
Then, ETL tools like Azure Data Factory (ADF) read this "metadata (instructions)" when performing processing and dynamically process data according to its contents.
What does this mean?
If you want to add a new data source, you no longer need to touch the program code.Just add a new line to the metadataADF automatically recognizes this and executes the new data integration process. This is a very powerful paradigm shift that will fundamentally change the way development works.
The Heart of a Metadata-Driven Framework: Its Architecture
So what exactly are the components that make up this clever system? Here we will introduce a typical architecture that uses Azure services.
Control Tower: The "brain" that stores metadata (Azure SQL Database)
This metadata database is where all the "instructions" are stored. It is often chosen as the most reliable and easy to manage database.Azure SQL Databaseis used. The following information is organized in a table here:
- Connection Information Table: Information such as which database or file server to connect to. It does not directly store sensitive information such as passwords, but only stores reference information to the Key Vault described later.
- Processing definition table: Define what type of data transfer you want to perform, such as "Copy from database to database" or "Copy from file to cloud storage".
- Pipeline Configuration Table: It defines a specific execution plan for what data to process from what data source, in what order, and with what transformations. It is the "recipe" of ETL processing.
This database acts as the "brain" of the entire framework.
The Execution Director: Parent Pipeline in ADF
First, create a single pipeline called the "parent pipeline" in Azure Data Factory to direct the entire process. This parent pipeline is very simple, and when it receives a specific process ID, it simply queries the metadata database, asking "What is the job description for this ID?" The database returns a list of processes to be executed in JSON format (a data format that is easy for computers to read).
Production Forces: Reusable Child Pipelines
The "child pipeline" actually executes the "job list" received by the parent pipeline. Child pipelines are created as templates for specific tasks, such as "copying data between databases" or "copying files."
The parent pipeline passes specific information obtained from the metadata (such as "Customer table on Server A" to "Customer master on Server B") as parameters to the child pipeline to instruct it to execute. This modular structure allows you to simply add a new child pipeline template when a new type of processing is required, keeping the overall system simple.
Key to security: Azure Key Vault
Security is one of the most important factors in data integration. It is very dangerous to store confidential information such as database passwords and API keys in plain text in metadata.Azure Key VaultWe use a "secret vault" service called
All sensitive information is stored securely in Key Vault, and metadata only records reference information indicating the location of the key. ADF pipelines access Key Vault only at runtime to obtain the necessary information, minimizing the risk of sensitive data leaking.
Who uses this technology and why?
The framework will be of particular benefit to the following individuals and organizations:
- Data engineers and IT architects: Dramatically reduce development and maintenance man-hours. With less manual coding and modification work, you can focus on designing a more strategic data utilization platform.
- Business: You can now quickly respond to sudden requests from the business side, such as "I want to add that data to the analysis." This dramatically improves the speed of data-based decision-making.
- Enterprise-wide: In addition to reducing development costs, standardizing and automating data integration processes that tend to be dependent on individual skills will enable the creation of a stable and scalable data infrastructure. This will be a powerful driving force behind companies' digital transformation (DX).
In addition, because it is built on the vast ecosystem that is Microsoft Azure, it integrates smoothly with other Azure services (machine learning, BI tools, etc.) and is highly expandable in the future.
Use cases and future potential
Specific usage scenarios
This framework proves its worth in a variety of data integration scenarios.
- Building a Data Warehouse (DWH)/Data Lake: Efficiently consolidate all data scattered throughout the company into a DWH or data lake that serves as the hub for analysis.
- On-premise to cloud data migration: Safely and systematically migrate large amounts of data from older on-premise systems to modern cloud platforms.
- Data synchronization between applicationsFor example, it can be used to automatically synchronize customer information and billing information between Salesforce (customer management) and an accounting system.
The future of this approach
Metadata-driven approaches are likely to become increasingly important in the future.
The biggest reason is scalability.As your business grows and you start dealing with dozens or even hundreds of data sources, this framework will not break down - just add metadata and the system will scale automatically.
More recently, Microsoft announced its Unified Analytics Platform."Microsoft Fabric"We are also working with Fabric. Data integration pipelines built with this framework will function seamlessly within the Fabric ecosystem, further accelerating the entire process from data preparation to analysis and visualization.
Comparison with conventional methods
Let's compare the advantages of a metadata-driven approach versus a traditional hard-coded approach.
| Evaluation item | Metadata-driven | Traditional hard coding |
|---|---|---|
| Flexibility and ease of modification | Very expensive(This can be done just by changing the metadata) | Low (code modification and testing required) |
| Speed of adding new sources | fast(Additional settings only) | Slow (requires separate development) |
| Maintainability | easily(The logic is centralized) | Difficult (deal with each pipeline separately) |
| Reusability | Very expensive(Reuse templated pipelines) | Low (individually made) |
| Initial build complexity | Somewhat expensive (requires framework design) | Low(It's easy to just create individual pipelines.) |
As you can see, it takes a bit of effort to set up initially, but the benefits in the long run are immeasurable.
Precautions and issues to be aware of when implementing
Of course, this approach is not a silver bullet, and there are some challenges and caveats to be aware of when adopting it.
- The Importance of Early Design: The design of the metadata database that is the foundation of the framework is very important. It is necessary to plan in advance what information will be managed and how, with an eye on future expansion.
- Performance Considerations: When dealing with large amounts of data, such as when processing hundreds of tables in parallel, performance can become an issue. You may need to use techniques to optimize performance, such as parallel processing or partitioning (dividing and processing data).
- Addressing tool constraintsAzure Data Factory also has some limitations, such as an upper limit on the number of processes that can be branched at one time. You may need to understand these limitations and devise ways to work around them in your architecture (for example, by categorizing processes and executing them in multiple stages).
These challenges can be overcome by experienced architects and engineers, but keep in mind that the initial implementation will require some learning and trial and error.
Summary: The key to opening up the future of data utilization
This time, we will delve into a slightly more technical topic, "Metadata-driven ETL framework" using Azure Data Factory.
The core of this approach isSeparating the "how" of ETL processing from the logic and focusing on the definition (metadata) of "what" to doThis frees us from the endless task of maintaining rigid data pipelines and gives us a more flexible, scalable, and efficient data infrastructure.
Although it takes time to design and learn as an initial investment, once this system is built, the speed and quality of subsequent data integration projects will improve dramatically. This is a very worthwhile investment for any company aiming to become a data-driven organization.
We hope this article will give you some tips to help you take your company's data utilization to the next level.
*The technologies and architectures introduced in this article are general examples. When applying them to an actual project, please take into consideration your own requirements and environment to create the optimal design.
Frequently Asked Questions (FAQ)
- Q1: Is it possible to build this system only by specialized developers?
- A1: The initial construction of the framework itself requires the efforts of data engineers and architects with knowledge of Azure Data Factory and databases. However, once the framework is complete, a major advantage is that daily operation (such as adding new data sources) can be performed by non-experts simply by managing metadata.
- Q2: What is the biggest difference between Azure Data Factory and traditional ETL tools such as SSIS?
- A2: While SSIS (SQL Server Integration Services) mainly runs on servers in on-premise environments, Azure Data Factory is a cloud-native (designed from the beginning to be used in the cloud) serverless service. The big difference is that you can easily build data integration in complex hybrid environments that span on-premise and cloud without having to worry about managing the infrastructure.
- Q3: Will implementing such a framework increase costs?
- A3: Azure Data Factory is a pay-as-you-go system where you are charged only for the amount of processing performed. Although there is an initial construction cost, it can significantly reduce subsequent development and maintenance efforts, so in the long run, the total cost of ownership (TCO) is often lower. In particular, the more data sources you manage, the greater the cost reduction effect.
Links
- Azure Data Factory official website
- Microsoft Learn: Azure Data Factory Documentation
- Reference article: Designing a metadata-driven ETL framework with Azure ADF



