Glue
Definition
AWS fully managed ETL service for preparing data for analytics. Like having a data processing factory that automatically cleans and organizes raw data.
Use Cases
- Zynga: Prepare and analyze large-scale game event data for analytics and reporting. — Zynga has described using AWS analytics services where AWS Glue is commonly used to crawl data in Amazon S3, maintain a central schema in the Glue Data Catalog, and run ETL jobs to transform raw logs into analytics-ready datasets queried by services like Amazon Athena. (Faster availability of curated datasets for analysts and more scalable processing of high-volume event data without managing ETL servers.)
- FINRA: Process and analyze large volumes of market data to support surveillance and compliance analytics. — FINRA has publicly discussed using AWS for big data analytics; AWS Glue is commonly used in such architectures to catalog data in S3 and run ETL transformations feeding downstream analytics (for example, query engines and data lakes). (Improved ability to organize and prepare large datasets for analytics workflows with managed, scalable ETL components.)
Provider Equivalents
- AWS: AWS Glue
- Azure: Azure Data Factory
- GCP: Cloud Data Fusion
- OCI: OCI Data Integration
Frequently Asked Questions
- What's the difference between AWS Glue and Amazon Athena?
- AWS Glue prepares and organizes data (ETL) and stores table definitions in the Glue Data Catalog. Amazon Athena is a query service that runs SQL directly on data in Amazon S3. In practice, Glue often creates/maintains the tables and partitions, and Athena queries them.
- When should I use AWS Glue?
- Use AWS Glue when you need to discover data (crawlers), maintain a central catalog of tables, and run managed ETL to clean, join, and transform data for analytics or machine learning. It’s a good fit for data lakes on S3, recurring batch pipelines, and situations where you don’t want to manage Spark clusters.
- How much does AWS Glue cost?
- Pricing is usage-based. Common cost drivers include: (1) ETL job run time and the amount of compute allocated (measured in DPUs for many Glue job types), (2) number of crawler runs and their duration, (3) Data Catalog object storage (tables/partitions) and requests, and (4) any additional features you use (for example, development endpoints in older workflows). Exact costs depend on how long jobs run, how often crawlers scan, and how much data is processed.
Category: data
Difficulty: advanced
Related Terms
See Also