Dataproc

Definition

Google Cloud's managed Apache Spark and Hadoop service for big data processing, simplifying cluster management and enabling fast data analysis.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Dataproc and BigQuery?
Dataproc runs open-source engines like Spark and Hadoop on managed clusters, so you write Spark jobs (Scala/PySpark) or run Hadoop ecosystem tools. BigQuery is a serverless data warehouse where you query data with SQL without managing clusters. Use Dataproc when you need Spark/Hadoop processing (custom code, specific libraries, or Hadoop tools). Use BigQuery when SQL analytics and managed warehousing are the priority.
When should I use Dataproc?
Use Dataproc when you need managed Spark/Hadoop clusters for batch ETL, machine learning feature engineering with Spark, log processing, or migrating existing on-prem Hadoop/Spark jobs to Google Cloud. It’s a good fit when you want control over cluster configuration, need Hadoop ecosystem components, or want to use Spark libraries that aren’t available in serverless SQL tools.
How much does Dataproc cost?
Dataproc pricing is mainly based on the underlying compute (VMs), storage, and networking you use, plus a Dataproc service fee for the cluster. Costs depend on cluster size (number/type of VMs), how long the cluster runs, whether you use autoscaling, and whether you use preemptible/spot VMs. A common cost-control approach is to use ephemeral clusters (create for a job, then delete) and store data in Cloud Storage instead of HDFS.

Category: data

Difficulty: advanced

Related Terms

See Also