Cloud Dataflow

Definition

Google Cloud's stream and batch data processing service, designed for real-time analytics and data transformation across various sources.

Use Cases

Provider Equivalents

Frequently Asked Questions

What's the difference between Cloud Dataflow and Cloud Dataproc?
Cloud Dataflow is a fully managed service for running Apache Beam pipelines (batch and streaming) with autoscaling and minimal cluster management. Cloud Dataproc is a managed Hadoop/Spark cluster service where you still think in terms of clusters and jobs (Spark, Hive, etc.). Use Dataflow when you want a serverless Beam-based pipeline; use Dataproc when you need Spark/Hadoop ecosystem tools or cluster-level control.
When should I use Cloud Dataflow?
Use Cloud Dataflow when you need to build reliable data pipelines that handle streaming (near real time) and/or batch processing, especially if you want autoscaling, managed operations, and Apache Beam portability. Common scenarios include streaming ETL from Pub/Sub to BigQuery, sessionization and windowed aggregations, log/event processing, and batch transformations for data warehouse loading.
How much does Cloud Dataflow cost?
Costs mainly depend on the worker resources used to run your jobs (CPU, memory, and time), plus any additional charges from connected services (for example Pub/Sub, BigQuery, Cloud Storage, and network egress). Streaming jobs often run continuously, so steady-state worker count and job uptime are key cost drivers. Batch jobs typically cost based on how long they run and how many workers they use.

Category: data

Difficulty: advanced

Related Terms

See Also