Question 1

What's the difference between Cloud Dataflow and Cloud Dataproc?

Accepted Answer

Cloud Dataflow is a fully managed service for running Apache Beam pipelines (batch and streaming) with autoscaling and minimal cluster management. Cloud Dataproc is a managed Hadoop/Spark cluster service where you still think in terms of clusters and jobs (Spark, Hive, etc.). Use Dataflow when you want a serverless Beam-based pipeline; use Dataproc when you need Spark/Hadoop ecosystem tools or cluster-level control.

Question 2

When should I use Cloud Dataflow?

Accepted Answer

Use Cloud Dataflow when you need to build reliable data pipelines that handle streaming (near real time) and/or batch processing, especially if you want autoscaling, managed operations, and Apache Beam portability. Common scenarios include streaming ETL from Pub/Sub to BigQuery, sessionization and windowed aggregations, log/event processing, and batch transformations for data warehouse loading.

Question 3

How much does Cloud Dataflow cost?

Accepted Answer

Costs mainly depend on the worker resources used to run your jobs (CPU, memory, and time), plus any additional charges from connected services (for example Pub/Sub, BigQuery, Cloud Storage, and network egress). Streaming jobs often run continuously, so steady-state worker count and job uptime are key cost drivers. Batch jobs typically cost based on how long they run and how many workers they use.

Cloud Dataflow

Definition

Use Cases

Provider Equivalents

Frequently Asked Questions

Related Terms

See Also