EMR
Definition
Elastic MapReduce - AWS big data platform using open source tools like Apache Spark and Hadoop, enabling scalable data processing and analysis.
Use Cases
- Pinterest: Large-scale ETL and analytics to process user engagement and advertising data. — Used Amazon EMR with Apache Spark/Hadoop to run batch processing pipelines on data stored in Amazon S3, scaling clusters up for heavy jobs and down when finished. (Improved ability to process growing datasets with elastic scaling and reduced operational overhead compared to self-managed Hadoop clusters.)
- Netflix: Big data processing for analytics and data pipelines supporting streaming operations and business reporting. — Runs large-scale batch processing on AWS using Amazon EMR with Hadoop/Spark ecosystem tools, commonly integrating with Amazon S3 as the data lake and using ephemeral clusters for scheduled workloads. (Faster iteration on data pipelines and the ability to scale compute for peak processing windows without maintaining fixed on-prem clusters.)
- Airbnb: Data warehousing and analytics workflows, including ETL jobs and experimentation analytics. — Adopted AWS big data tooling including Amazon EMR for Spark/Hadoop-based processing, typically reading/writing data in S3 and orchestrating recurring jobs. (Supported rapid growth in analytics needs by scaling compute on demand and standardizing on widely used open-source processing frameworks.)
Provider Equivalents
- AWS: Amazon EMR
- Azure: Azure HDInsight
- GCP: Google Cloud Dataproc
- OCI: OCI Data Flow
Frequently Asked Questions
- What's the difference between Amazon EMR and AWS Glue?
- Amazon EMR is a managed cluster platform where you run big data engines like Spark, Hadoop, Hive, and Presto/Trino with a lot of control over the environment. AWS Glue is a managed, serverless data integration service focused on ETL (extract, transform, load) and a data catalog. Use EMR when you need full-featured Spark/Hadoop clusters, custom configurations, or long-running/complex jobs; use Glue when you want simpler, serverless ETL with less cluster management.
- When should I use EMR?
- Use EMR when you need to process large datasets with Spark/Hadoop ecosystem tools, especially for batch ETL, log processing, large-scale joins/aggregations, machine learning feature generation, or running SQL engines like Hive/Presto/Trino. EMR is a good fit when you want elastic scaling, integration with S3 and IAM, and the flexibility to tune cluster size, instance types, and software settings.
- How much does EMR cost?
- EMR pricing is mainly (1) the underlying compute instances (EC2) you run, (2) an additional EMR service charge per instance-hour (varies by EMR release and instance type), and (3) storage and data transfer (e.g., S3, EBS, inter-AZ traffic). Costs depend on cluster size, how long clusters run, whether you use Spot Instances, and whether you keep clusters always-on versus creating ephemeral clusters per job. You can reduce cost by using Spot for task nodes, right-sizing instance types, using auto-scaling, and storing data in S3 instead of HDFS when appropriate.
Category: data
Difficulty: advanced
Related Terms
See Also