MapReduce
Definition
A programming model for processing large datasets in parallel across many machines, optimizing data processing efficiency and speed.
Use Cases
- Google: Large-scale web indexing and data processing — Google developed the MapReduce programming model to process massive datasets across distributed clusters, including tasks such as building search indexes, processing crawled web pages, and analyzing logs. (Enabled Google engineers to process internet-scale datasets reliably by splitting work across thousands of machines and automatically handling failures.)
- Yahoo: Web search indexing and large-scale analytics — Yahoo was an early major user and contributor to Apache Hadoop, using Hadoop MapReduce to process large volumes of web and search-related data on large clusters. (Helped prove Hadoop MapReduce could support production-scale data processing and contributed to Hadoop becoming widely adopted in enterprise big data systems.)
- The New York Times: Digitizing and processing historical newspaper archives — The New York Times used Hadoop running on Amazon EC2 to process large numbers of scanned archive images and generate web-ready output. (Completed a large batch-processing job in the cloud without buying permanent infrastructure, demonstrating the practical value of elastic cloud-based data processing.)
Provider Equivalents
- AWS: Amazon EMR
- Azure: Azure HDInsight
- GCP: Google Cloud Dataproc
- OCI: Oracle Cloud Infrastructure Big Data Service
Frequently Asked Questions
- What's the difference between MapReduce and Apache Spark?
- MapReduce processes data in separate map and reduce stages and usually writes intermediate results to disk. Apache Spark can keep data in memory between steps, which often makes it faster for iterative analytics, machine learning, and interactive queries. MapReduce is reliable and good for large batch jobs, while Spark is usually preferred for newer big data workloads that need speed and flexibility.
- When should I use MapReduce?
- Use MapReduce when you need to process very large datasets in batch mode, the work can be split into independent pieces, and speed is less important than reliability and scale. Common examples include log analysis, large file transformations, indexing, counting, sorting, and aggregating records. For real-time analytics, low-latency queries, or iterative machine learning, tools like Spark, Flink, or cloud data warehouses may be better choices.
- How much does MapReduce cost?
- MapReduce itself is a programming model and does not have a direct price. Costs come from the infrastructure or managed service used to run it, such as compute nodes, storage, data transfer, cluster runtime, and management fees. In cloud services like Amazon EMR, Azure HDInsight, Google Cloud Dataproc, or OCI Big Data Service, you typically pay for virtual machines, disks or object storage, and the time the cluster is running. Costs can be reduced by using autoscaling, spot or preemptible instances, right-sized clusters, and shutting clusters down after jobs finish.
Category: data
Difficulty: advanced
Related Terms
See Also