Hadoop
Definition
Open-source framework for storing and processing massive datasets across clusters of computers, facilitating big data analytics and management.
Use Cases
- Yahoo: Large-scale web indexing, log processing, and data analytics. — Yahoo was an early major contributor to Hadoop and ran large Hadoop clusters to store and process massive web and user activity datasets across commodity servers. (Hadoop enabled Yahoo to process internet-scale data more economically and helped prove that distributed processing on clusters of commodity hardware could work at very large scale.)
- Facebook: Data warehousing and analytics for user activity, product metrics, and advertising insights. — Facebook used Hadoop with related tools such as Hive to store and query huge volumes of event and log data generated by its platform. (Teams could analyze large datasets for product decisions, growth analysis, and ad measurement without relying only on traditional relational databases.)
- eBay: Analyzing marketplace data, search behavior, customer activity, and transaction patterns. — eBay used Hadoop-based data platforms to process large volumes of structured and semi-structured marketplace data for analytics and experimentation. (The company improved its ability to run large-scale analysis, support recommendation and search improvements, and make data-driven marketplace decisions.)
Provider Equivalents
- AWS: Amazon EMR
- Azure: Azure HDInsight
- GCP: Google Cloud Dataproc
- OCI: OCI Big Data Service
Frequently Asked Questions
- What's the difference between Hadoop and Spark?
- Hadoop is a broader ecosystem for distributed storage and batch processing, especially using HDFS and MapReduce. Spark is a faster distributed processing engine that often runs on Hadoop-compatible storage. In simple terms, Hadoop is commonly associated with storing huge datasets and processing them in batches, while Spark is often used when teams need faster analytics, iterative machine learning, or near-real-time processing.
- When should I use Hadoop?
- Use Hadoop when you need to store and process very large datasets that do not fit well on one machine, especially for batch analytics over logs, transactions, clickstreams, sensor data, or historical records. It is most useful when data volume is measured in terabytes or petabytes and parallel processing can reduce processing time. For smaller datasets, real-time applications, or simple reporting, a cloud data warehouse, relational database, or managed analytics service may be easier.
- How much does Hadoop cost?
- Hadoop itself is open source, so there is no software license cost for the core Apache Hadoop project. The main costs are compute servers or cloud instances, storage, networking, cluster management, monitoring, backups, security, and skilled administrators. Managed services such as Amazon EMR, Azure HDInsight, Google Cloud Dataproc, and OCI Big Data Service charge for the underlying compute and storage resources, and may also include service-specific management fees.
Category: data
Difficulty: advanced
Related Terms
See Also