Question 1

What's the difference between Dataproc and BigQuery?

Accepted Answer

Dataproc runs open-source engines like Spark and Hadoop on managed clusters, so you write Spark jobs (Scala/PySpark) or run Hadoop ecosystem tools. BigQuery is a serverless data warehouse where you query data with SQL without managing clusters. Use Dataproc when you need Spark/Hadoop processing (custom code, specific libraries, or Hadoop tools). Use BigQuery when SQL analytics and managed warehousing are the priority.

Question 2

When should I use Dataproc?

Accepted Answer

Use Dataproc when you need managed Spark/Hadoop clusters for batch ETL, machine learning feature engineering with Spark, log processing, or migrating existing on-prem Hadoop/Spark jobs to Google Cloud. It’s a good fit when you want control over cluster configuration, need Hadoop ecosystem components, or want to use Spark libraries that aren’t available in serverless SQL tools.

Question 3

How much does Dataproc cost?

Accepted Answer

Dataproc pricing is mainly based on the underlying compute (VMs), storage, and networking you use, plus a Dataproc service fee for the cluster. Costs depend on cluster size (number/type of VMs), how long the cluster runs, whether you use autoscaling, and whether you use preemptible/spot VMs. A common cost-control approach is to use ephemeral clusters (create for a job, then delete) and store data in Cloud Storage instead of HDFS.

Dataproc

Definition

Use Cases

Provider Equivalents

Frequently Asked Questions

Related Terms

See Also