Hadoop

Definition

Open-source framework for storing and processing massive datasets across clusters of computers, facilitating big data analytics and management.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Hadoop and Spark?
Hadoop is a broader ecosystem for distributed storage and batch processing, especially using HDFS and MapReduce. Spark is a faster distributed processing engine that often runs on Hadoop-compatible storage. In simple terms, Hadoop is commonly associated with storing huge datasets and processing them in batches, while Spark is often used when teams need faster analytics, iterative machine learning, or near-real-time processing.
When should I use Hadoop?
Use Hadoop when you need to store and process very large datasets that do not fit well on one machine, especially for batch analytics over logs, transactions, clickstreams, sensor data, or historical records. It is most useful when data volume is measured in terabytes or petabytes and parallel processing can reduce processing time. For smaller datasets, real-time applications, or simple reporting, a cloud data warehouse, relational database, or managed analytics service may be easier.
How much does Hadoop cost?
Hadoop itself is open source, so there is no software license cost for the core Apache Hadoop project. The main costs are compute servers or cloud instances, storage, networking, cluster management, monitoring, backups, security, and skilled administrators. Managed services such as Amazon EMR, Azure HDInsight, Google Cloud Dataproc, and OCI Big Data Service charge for the underlying compute and storage resources, and may also include service-specific management fees.

Category: data

Difficulty: advanced

Related Terms

See Also