Data Factory
Definition
Azure Data Factory is a cloud-based data integration service that allows users to create, schedule, and manage data-driven workflows efficiently.
Use Cases
- Rolls-Royce: Ingesting and integrating large volumes of operational and IoT data to support analytics and monitoring. — Used Azure Data Factory to orchestrate data movement from multiple sources into Azure data platforms for downstream processing and analytics workflows. (Improved reliability of data ingestion and enabled faster access to integrated data for analytics use cases.)
- Heathrow Airport: Integrating data from operational systems to support reporting and analytics for airport operations. — Used Azure Data Factory pipelines to move and transform data into centralized analytics stores on Azure, coordinating scheduled loads and dependencies. (More consistent data refreshes and improved visibility for operational reporting and analytics.)
Provider Equivalents
- AWS: AWS Glue
- Azure: Azure Data Factory
- GCP: Cloud Data Fusion
- OCI: OCI Data Integration
Frequently Asked Questions
- What's the difference between Azure Data Factory and Azure Synapse Pipelines?
- They share a very similar pipeline authoring experience. Azure Data Factory is a standalone data integration service focused on orchestrating data movement and transformation across many systems. Synapse Pipelines provides similar orchestration capabilities but is integrated into Azure Synapse Analytics, making it convenient when your primary analytics workspace is Synapse.
- When should I use Data Factory?
- Use Azure Data Factory when you need to regularly move data between systems (for example, on a schedule or triggered by events), orchestrate multi-step workflows with dependencies, and connect to many data sources. Common scenarios include loading data into a data warehouse/lake, copying data between on-premises and cloud, and coordinating transformations using mapping data flows or external compute like Databricks.
- How much does Data Factory cost?
- Pricing is usage-based. Key cost drivers typically include pipeline orchestration/activity runs, data movement (copy activity and integration runtime usage), and transformation compute (for example, Mapping Data Flows use managed Spark clusters billed by time and capacity). Costs vary by region, number of runs, data volume, and whether you use self-hosted vs managed integration runtimes.
Category: data
Difficulty: intermediate
Related Terms
See Also