gcp
System Design
advanced
Enterprise analytics and business intelligence

Data Lake & Analytics Platform

Modern Data Stack

A modern data lake architecture on GCP separates storage from compute using Cloud Storage and BigQuery. This design uses a medallion architecture (raw → curated → aggregated) with Dataflow for streaming and batch ETL, BigQuery for serverless SQL analytics, and Pub/Sub for real-time event ingestion. Built for data engineering teams centralizing analytics from multiple sources into a governed, query-ready data platform.

Data Flow

Stream Ingestion
Stream Processor
Dataproc Spark
Raw Data Lake
Dataflow ETL
Curated Layer
BigQuery Warehouse
API Layer

Share this architecture with your network

Service Breakdown (8 services)

Other8 services
Stream Ingestion
  • Captures streaming events from multiple producers
  • Delivers messages to subscriber pipelines reliably
  • Supports ordering guarantees within partitions
Stream Processor
  • Runs event-driven code without servers
  • Scales instantly from zero to peak load
  • Cost-effective for sporadic workloads
Raw Data Lake
  • Stores unprocessed data in its original format
  • Supports all file types for schema-on-read analytics
  • Serves as the single source of truth for raw events
Dataflow ETL
  • Runs Apache Beam pipelines for batch and stream
  • Auto-scales workers for pipeline throughput
  • Integrates with BigQuery and Pub/Sub
Curated Layer
  • Holds cleaned and transformed datasets for analysis
  • Applies data quality rules before promotion
  • Organized by domain for easy discovery and access
BigQuery Warehouse
  • Runs complex analytical queries at petabyte scale
  • Supports real-time streaming inserts for freshness
  • Integrates with BI tools for dashboards and reports
API Layer
  • Runs event-driven code without servers
  • Scales instantly from zero to peak load
  • Cost-effective for sporadic workloads
Dataproc Spark
  • Runs Spark jobs for large-scale data transformation
  • Auto-scales worker nodes based on job complexity
  • Processes batch ETL pipelines on schedule

Scaling Strategy

Cloud Storage provides virtually unlimited storage that scales automatically. Pub/Sub handles real-time ingestion with automatic scaling. Dataflow pipelines auto-scale workers based on backlog. BigQuery runs serverlessly — you pay per query with automatic slot allocation. Dataproc clusters spin up on demand for Spark workloads and auto-scale based on YARN metrics.

Related Architectures