EMR

Definition

Elastic MapReduce - AWS big data platform using open source tools like Apache Spark and Hadoop, enabling scalable data processing and analysis.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Amazon EMR and AWS Glue?
Amazon EMR is a managed cluster platform where you run big data engines like Spark, Hadoop, Hive, and Presto/Trino with a lot of control over the environment. AWS Glue is a managed, serverless data integration service focused on ETL (extract, transform, load) and a data catalog. Use EMR when you need full-featured Spark/Hadoop clusters, custom configurations, or long-running/complex jobs; use Glue when you want simpler, serverless ETL with less cluster management.
When should I use EMR?
Use EMR when you need to process large datasets with Spark/Hadoop ecosystem tools, especially for batch ETL, log processing, large-scale joins/aggregations, machine learning feature generation, or running SQL engines like Hive/Presto/Trino. EMR is a good fit when you want elastic scaling, integration with S3 and IAM, and the flexibility to tune cluster size, instance types, and software settings.
How much does EMR cost?
EMR pricing is mainly (1) the underlying compute instances (EC2) you run, (2) an additional EMR service charge per instance-hour (varies by EMR release and instance type), and (3) storage and data transfer (e.g., S3, EBS, inter-AZ traffic). Costs depend on cluster size, how long clusters run, whether you use Spot Instances, and whether you keep clusters always-on versus creating ephemeral clusters per job. You can reduce cost by using Spot for task nodes, right-sizing instance types, using auto-scaling, and storing data in S3 instead of HDFS when appropriate.

Category: data

Difficulty: advanced

Related Terms

See Also