gcp

System Design

advanced

Enterprise analytics and business intelligence

Data Lake & Analytics Platform

Modern Data Stack

A modern data lake architecture on GCP separates storage from compute using Cloud Storage and BigQuery. This design uses a medallion architecture (raw → curated → aggregated) with Dataflow for streaming and batch ETL, BigQuery for serverless SQL analytics, and Pub/Sub for real-time event ingestion. Built for data engineering teams centralizing analytics from multiple sources into a governed, query-ready data platform.

Data Flow

Stream Ingestion

Stream Processor

Dataproc Spark

Raw Data Lake

Dataflow ETL

Curated Layer

BigQuery Warehouse

API Layer

Share this architecture with your network

Service Breakdown (8 services)

Other8 services

Stream Ingestion

•Captures streaming events from multiple producers
•Delivers messages to subscriber pipelines reliably
•Supports ordering guarantees within partitions

Stream Processor

•Runs event-driven code without servers
•Scales instantly from zero to peak load
•Cost-effective for sporadic workloads

Raw Data Lake

•Stores unprocessed data in its original format
•Supports all file types for schema-on-read analytics
•Serves as the single source of truth for raw events

Dataflow ETL

•Runs Apache Beam pipelines for batch and stream
•Auto-scales workers for pipeline throughput
•Integrates with BigQuery and Pub/Sub

Curated Layer

•Holds cleaned and transformed datasets for analysis
•Applies data quality rules before promotion
•Organized by domain for easy discovery and access

BigQuery Warehouse

•Runs complex analytical queries at petabyte scale
•Supports real-time streaming inserts for freshness
•Integrates with BI tools for dashboards and reports

API Layer

•Runs event-driven code without servers
•Scales instantly from zero to peak load
•Cost-effective for sporadic workloads

Dataproc Spark

•Runs Spark jobs for large-scale data transformation
•Auto-scales worker nodes based on job complexity
•Processes batch ETL pipelines on schedule

Scaling Strategy

Cloud Storage provides virtually unlimited storage that scales automatically. Pub/Sub handles real-time ingestion with automatic scaling. Dataflow pipelines auto-scale workers based on backlog. BigQuery runs serverlessly — you pay per query with automatic slot allocation. Dataproc clusters spin up on demand for Spark workloads and auto-scale based on YARN metrics.

Related Architectures

YouTube Video Streaming System

YouTube / Google

Video upload, transcoding, and adaptive bitrate streaming on GCP handling 500+ hours of video uploaded per minute.

advanced

System Design

Web Crawler System

System Design Classic

Distributed web crawler on GCP with Pub/Sub URL frontier, Cloud Run workers, deduplication, and content extraction at web scale.

advanced

System Design

Multi-Tenant SaaS Platform

Generic SaaS

Production-ready multi-tenant SaaS with tenant isolation, feature flags, usage metering, and self-serve onboarding.

intermediate

System Design

URL Shortener System

System Design Classic

High-throughput URL shortening service with analytics, custom aliases, and 301/302 redirect handling at scale.

beginner

System Design

Notification System

System Design Classic

Multi-channel notification system on Azure supporting push, email, SMS, and in-app notifications with Event Grid fan-out.

intermediate

System Design

Dropbox File Storage System

Dropbox

Cloud file storage on Azure with chunked uploads to Blob Storage, delta sync, deduplication, and cross-device synchronization.

intermediate

System Design

Data Lake & Analytics Platform

Remix this architecture in Canvas