System Design Classic
A web crawler systematically downloads and indexes the internet, forming the backbone of search engines. This GCP-native design handles URL frontier management (prioritizing which URLs to crawl next via Pub/Sub topics), politeness policies (respecting robots.txt and rate limits per domain), content deduplication (detecting near-duplicate pages via SimHash in Memorystore), and distributed coordination across hundreds of Cloud Run crawler instances. Suited for search engine teams building scalable crawlers that respect robots.txt, manage politeness delays, and deduplicate content.
Share this architecture with your network
Crawler workers run on Cloud Run with automatic scaling based on Pub/Sub queue depth. Pub/Sub manages the URL frontier with ordered topics for priority-based crawling. Memorystore stores the bloom filter for URL deduplication and per-domain rate limit counters. Crawled content goes to Cloud Storage, with Firestore tracking crawl metadata. Dataflow processes discovered URLs back to the frontier. Workers scale based on subscription backlog while respecting per-domain politeness constraints.
Data Lake & Analytics Platform
Modern Data Stack
YouTube Video Streaming System
YouTube / Google
Multi-Tenant SaaS Platform
Generic SaaS
Notification System
System Design Classic
Dropbox File Storage System
Dropbox
Pastebin System
System Design Classic
Web Crawler System
Remix this architecture in Canvas